Architecture Decision Records | Hydra Head protocol documentation

31. Achieve constant memory in hydra-node

January 27, 2024 · 2 min read

Senior Software Engineer

Status

Accepted

Context

When testing out hydra-node operation under heavy or increased load we are noticing that memory consumption is far from ideal. So far we didn't bother thinking about the performance so much but time has come to try and reduce memory footprint of a running hydra-node.

There are some quick points to be scored here since our projections that are used to serve in-memory data are using a common haskell list as a data structure. We should stream the data keeping the memory bounded as the first optimisation.

It is also not necessary to output the whole history of messages by default and only do that if clients request to see the whole history. Internally our ServerOutput type could be remapped to StateChanged since the two are almost identical. Any new information must be streamed to the clients automatically.

Decision

Re-map ServerOutput to StateChanged by adding any missing constructor to StateChanged (eg. PeerConnected).
Output new client messages on newState changes instead of using ClientEffect.
Use StateChanged in all projections we server from the API (re-use eventId as sequence number).
Make hydra-node output history of messages only on demand (breaking change is to be communicated in the changelog).
Use conduit library to achieve constant memory by streaming the data in our projections.

Consequences

This should lead to much better performance of hydra-node in terms of used memory for the running process. This should be also confirmed by running the relevant benchmarks and do a test (even manual or a script) to assert that the memory consumption is actually reduced.

32. Network layer properties, implementation using etcd

February 12, 2025 · 4 min read

Sebastian Nagel

Software Engineering Lead

Status

Accepted

Context

The communication primitive of broadcast is introduced in ADR 6. The original protocol design in the paper and that ADR implicitly assume a reliable broadcast.
ADR 27 further specifies that the hydra-node should be tolerant to the fail-recovery failure model, and takes the decision to implement a reliable broadcast by persisting outgoing messages and using a vector clock and heartbeat mechanism, over a dumb transport layer.
- The current transport layer in use is a simple FireForget protocol over TCP connections implemented using ouroboros-framework.
- ADR 17 proposed to use UDP instead
- Either this design or its implementation was discovered to be wrong, because this system did not survive fault injection tests with moderate package drops.
This research paper explored various consensus protocols used in blockchain space and reminds us of the correspondence between consensus and broadcasts:

the form of consensus relevant for blockchain is technically known as atomic broadcast

It also states that (back then):

The most important and most prominent way to implement atomic broadcast (i.e., consensus) in distributed systems prone to t < n/2 node crashes is the family of protocols known today as Paxos and Viewstamped Replication (VSR).

Decision

We realize that the way the off-chain protocol is specified in the paper, the broadcast abstraction required from the Network interface is a so-called uniform reliable broadcast. Hence, any implementation of Network needs to satisfy the following properties:
1. Validity: If a correct process p broadcasts a message m, then p eventually delivers m.
2. No duplication: No message is delivered more than once.
3. No creation: If a process delivers a message m with sender s, then m was previously broadcast by process s.
4. Agreement: If a message m is delivered by some correct process, then m is eventually delivered by every correct process.
See also Module 3.3 in Introduction to Reliable and Secure Distributed Programming by Cachin et al, or Self-stabilizing Uniform Reliable Broadcast by Oskar Lundström
Use etcd as a proxy to achieve reliable broadcast via its raft consensus
- Raft is an evolution of Paxos and similar to VSR
- Over-satisfies requirements as it provides "Uniform total order" (satisfies atomic broadcast properties)
- Each hydra-node runs a etcd instance to realize its Network interface
- See the following architecture diagram which also contains some notes on Network interface properties:

We supersede ADR 17 and ADR 27 decisions on how to implement Network with the current ADR.
- Drop existing implementation of Ouroboros and Reliability components
- Could be revisited, as in theory it would satisfy properties if implemented correctly?
- Uniform reliable broadcast = only deliver when seen by everyone = not what we had implemented?

Consequences

Crash tolerance of up to n/2 failing nodes
Using etcd as-is adds a run-time dependency onto that binary.
- Docker image users should not see any different UX
- We can ship the binary through hydra-node.
Introspectability network as the etcd cluster is queryable could improve debugging experience
Persisted state for networking changes as there will be no acks, but the etcd Write Ahead Log (WAL) and a last seen revision.
Can keep same user experience on configuration
- Full, static topology with listing everyone as --peer
- Simpler configuration via peer discovery possible
PeerConnected semantics needs to change to an overall HydraNetworkConnected
- We can only submit / receive messages when connected to the majority cluster
etcd has a few features out-of-the-box we could lean into, e.g.
- use TLS to secure peer connections
- separate advertised and binding addresses

Status​

Context​

Decision​

Consequences​

Status​

Context​

Decision​

Consequences​

Status

Context

Decision

Consequences

Status

Context

Decision

Consequences