32. Network layer properties, implementation using etcd
Status
Accepted
Context
-
The communication primitive of
broadcastis introduced in ADR 6. The original protocol design in the paper and that ADR implicitly assume a reliable broadcast. -
ADR 27 further specifies that the
hydra-nodeshould be tolerant to the fail-recovery failure model, and takes the decision to implement a reliable broadcast by persisting outgoing messages and using a vector clock and heartbeat mechanism, over a dumb transport layer.- The current transport layer in use is a simple FireForget protocol over TCP connections implemented using
ouroboros-framework. - ADR 17 proposed to use UDP instead
- Either this design or its implementation was discovered to be wrong, because this system did not survive fault injection tests with moderate package drops.
- The current transport layer in use is a simple FireForget protocol over TCP connections implemented using
-
This research paper explored various consensus protocols used in blockchain space and reminds us of the correspondence between consensus and broadcasts:
the form of consensus relevant for blockchain is technically known as atomic broadcast
It also states that (back then):
The most important and most prominent way to implement atomic broadcast (i.e., consensus) in distributed systems prone to t < n/2 node crashes is the family of protocols known today as Paxos and Viewstamped Replication (VSR).
Decision
-
We realize that the way the off-chain protocol is specified in the paper, the
broadcastabstraction required from theNetworkinterface is a so-called uniform reliable broadcast. Hence, any implementation ofNetworkneeds to satisfy the following properties:- Validity: If a correct process p broadcasts a message m, then p eventually delivers m.
- No duplication: No message is delivered more than once.
- No creation: If a process delivers a message m with sender s, then m was previously broadcast by process s.
- Agreement: If a message m is delivered by some correct process, then m is eventually delivered by every correct process.
See also Module 3.3 in Introduction to Reliable and Secure Distributed Programming by Cachin et al, or Self-stabilizing Uniform Reliable Broadcast by Oskar Lundström
-
Use
etcdas a proxy to achieve reliable broadcast via its raft consensus- Raft is an evolution of Paxos and similar to VSR
- Over-satisfies requirements as it provides "Uniform total order" (satisfies atomic broadcast properties)
- Each
hydra-noderuns aetcdinstance to realize itsNetworkinterface - See the following architecture diagram which also contains some notes on
Networkinterface properties:

- We supersede ADR 17 and ADR 27 decisions on how to implement
Networkwith the current ADR.- Drop existing implementation of
OuroborosandReliabilitycomponents - Could be revisited, as in theory it would satisfy properties if implemented correctly?
- Uniform reliable broadcast = only deliver when seen by everyone = not what we had implemented?
- Drop existing implementation of
Consequences
-
Crash tolerance of up to
n/2failing nodes -
Using
etcdas-is adds a run-time dependency onto that binary.- Docker image users should not see any different UX
- We can ship the binary through
hydra-node.
-
Introspectability network as the
etcdcluster is queryable could improve debugging experience -
Persisted state for networking changes as there will be no
acks, but theetcdWrite Ahead Log (WAL) and a last seen revision. -
Can keep same user experience on configuration
- Full, static topology with listing everyone as
--peer - Simpler configuration via peer discovery possible
- Full, static topology with listing everyone as
-
PeerConnectedsemantics needs to change to an overallHydraNetworkConnected- We can only submit / receive messages when connected to the majority cluster
-
etcdhas a few features out-of-the-box we could lean into, e.g.- use TLS to secure peer connections
- separate advertised and binding addresses


