Status
Accepted
Context
The current Head cluster is very fragile as has been observed on several occasions: A single hiccup in the connectivity between nodes while a head is open and nodes are exchanging messages can very easily lead to the Head being stuck and require an emergency closing, possibly even manually.
We want Hydra to be Consistent in the presence of Network Partitions, under the fail-recovery model assumption, eg. processes may fail by stopping and later recovering. Our system lies in the CP space of the landscape mapped by the CAP theorem.
We have identified 3 main sources of failures in the fail-recovery model that can lead to a head being stuck:
- The network layer can drop messages from the moment a node
broadcast
s it, leading to some messages not being received at the other end - The sending node can crash in between the moment the state is changed (and persisted) and the moment a message is actually sent through the network (or even when it calls
broadcast
) - The receiving node can crash in between the moment the message has been received in the network layer, and it's processed (goes through the queue)
We agree that we'll want to address all those issues in order to provide a good user experience, as not addressing 2. and 3. can lead to hard to troubleshoot issues with heads. We have not experienced those issues yet as they would probably only crop up under heavy loads, or in the wild. But we also agree we want to tackle 1. first because it's where most of the risk lies. By providing a Reliable Broadcast layer, we will significantly reduce the risks and can then later on address the other points.
Therefore, the scope of this ADR is to address only point 1. above: Ensure broadcast messages are eventually received by all peers, given the sender does not stop before.
Discussion
We are currently using the ouroboros-framework and typed-protocols network stack as a mere transport layer.
- Being built on top of TCP, ouroboros multiplexer (Mux) provides the same reliability guarantees, plus the multiplexing capabilities of course
- It also takes care of reconnecting to peers when a failure is detected which relieves us from doing so, but any reconnection implies a reset of each peer's state machine which means we need to make sure any change to the state of pending/received messages is handled by the applicative layer
- Our FireForget protocol ignores connections/disconnections
- Ouroboros/typed-protocols provides enough machinery to implement a reliable broadcast protocol, for example by reusing existing
[KeepAlive](https://github.com/input-output-hk/ouroboros-network/tree/master/ouroboros-network-protocols/src/Ouroboros/Network/Protocol/KeepAlive)
protocol and building a more robust point-to-point protocol than what we have now - There is a minor limitation, namely that the subscription mechanism does not handle connections invidually, but as a set of equivalent point-to-point full duplex connections whose size (valency) needs to be maintained at a certain threshold, which means that unless backed in the protocol itself, protocol state-machine and applications are not aware of the identity of the remote peer
We have built our
Network
infrastructure over the concept of relatively independent layers, each implementing a similar interface with different kind of messages, tobroadcast
messages to all peers and be notified of incoming messages through acallback
.This pipes-like abstraction allows us to compose our network stack like:
withAuthentication (contramap Authentication tracer) signingKey otherParties $
withHeartbeat nodeId connectionMessages $
withOuroborosNetwork (contramap Network tracer) localhost peersThis has the nice property that we can basically swap the lower layers should we need to, for example to use UDP, or add other layers for example to address specific head instances in presence of multiple heads
Decision
- We implement our own message tracking and resending logic as a standalone
Network
layer - That layer consumes and produces
Authenticated msg
messages as it relies on identifying the source of messages - It uses a vector of monotonically increasing sequence numbers associated with each party (including itself) to track what are the last messages from each party and to ensure FIFO delivery of messages
- This vector is used to identify peers which are lagging behind, resend the missing messages, or to drop messages which have already been received
- The Heartbeat mechanism is relied upon to ensure dissemination of state even when the node is quiescent
- We do not implement a pull-based message communication mechanism as initially envisioned
- We do not persist messages either on the receiving or sending side at this time
Consequences
- We keep our existing
Network
interface hence all messages will be resent to all peers- This could be later optimized either by providing a smarter interface with a
send :: Peer -> msg -> m ()
unicast function, or by adding a layer with filtering capabilities, or both
- This could be later optimized either by providing a smarter interface with a
- We want to specify this protocol clearly in order to ease implementation in other languages, detailing the structure of messages and the semantics of retries and timeouts.
- We may consider relying on the vector clock in the future to ensure perfect ordering of messages on each peer and make impossible for legit transactions to be temporarily seen as invalid. This can happen in the current version and is handled through wait and TTL