Skip to main content

3 posts tagged with "Superseded"

View All Tags

17. Use UDP protocol for Hydra networking

· 2 min read

Status

Superseded (as never implemented) by ADR 32

Context

Current Hydra networking layer is based on Ouroboros network framework networking stack which, among other features, provides:

  1. An abstraction of stream-based duplex communication channels called a Snocket,
  2. A Multiplexing connection manager that manages a set of equivalent peers, maintains connectivity, and ensures diffusion of messages to/from all peers,
  3. Typed protocols for expressing the logic of message exchanges as a form of state machine.

While it's been working mostly fine so far, the abstractions and facilities provided by this network layer are not well suited for Hydra Head networking. Some of the questions and shortcomings are discussed in a document on Networking Requirements, and as the Hydra Head matures it seems time is ripe for overhauling current network implementation to better suite current and future Hydra Head networks needs.

Decision

  • Hydra Head nodes communicate by sending messages to other nodes using UDP protocol

Details

  • How do nodes know each other?: This is unspecified by this ADR and left for future work, it is assumed that a Hydra node operator knows the IP:Port address of its peers before opening a Head with them
  • Are messages encrypted?: This should probably be the case in order to ensure Heads' privacy but is also left for future work
  • How are nodes identified?: At the moment they are identified by their IP:Port pair. As we implement more of the setup process from section 4 of the Hydra Head paper, we should identify nodes by some public key(hash) and resolve the actual IP:Port pair using some other mechanism

Consequences

  • Node's HeadLogic handles lost, duplicates, and out-of-order messages using retry and timeout mechanisms
  • Messages should carry a unique identifier, eg. source node and index
  • Protocol, eg. messages format, is documented

18. Single state in Hydra.Node.

· 4 min read

Status

Superseded by ADR 23 and ADR 26

Context

  • Currently the hydra-node maintains two pieces of state during the life-cycle of a Hydra Head:
    1. A HeadState tx provided by the HydraHead tx m handle interface and part of the Hydra.Node module. It provides the basis for the main hydra-node business logic in Hydra.Node.processNextEvent and Hydra.HeadLogic.update Creation, Usage
    2. SomeOnChainHeadState is kept in the Hydra.Chain.Direct to keep track of the latest known head state, including notable transaction outputs and information how to spend it (e.g. scripts and datums) Code, Usage 1, Usage 2, Usage 3 (There are other unrelated things kept in memory like the event history in the API server or a peer map in the network heartbeat component.)
  • The interface between the Hydra.Node and a Hydra.Chain component consists of
    • constructing certain Head protocol transactions given a description of it (PostChainTx tx):
      postTx :: MonadThrow m => PostChainTx tx -> m ()
    • a callback function when the Hydra.Chain component observed a new Head protocol transaction described by OnChainTx tx:
      type ChainCallback tx m = OnChainTx tx -> m ()
  • Given by the usage sites above, the Hydra.Chain.Direct module requires additional info to do both, construct protocol transactions with postTx as well as observe potential OnChainTx (here). Hence we see that, operation of the Hydra.Chain.Direct component (and likely any implementing the interface fully) is inherently stateful.
  • We are looking at upcoming features to handle rollbacks and dealing with persisting the head state.
    • Both could benefit from the idea, that the HeadState is just a result of pure Event processing (a.k.a event sourcing).
    • Right now the HeadState kept in Hydra.Node alone, is not enough to fully describe the state of the hydra-node. Hence it would not be enough to just persist all the Events and replaying them to achieve persistence, nor resetting to some previous HeadState in the presence of a rollback.

Decision

  • We define and keep a "blackbox" ChainStateType tx in the HeadState tx
    • It shall not be introspectable to the business logic in HeadLogic
    • It shall contain chain-specific information about the current Hydra Head, which will naturally need to evolve once we have multiple Heads in our feature scope
    • For example:
    data HeadState tx
    = IdleState
    | InitialState
    { chainState :: ChainStateType tx
    -- ...
    }
    | OpenState
    { chainState :: ChainStateType tx
    -- ...
    }
    | ClosedState
    { chainState :: ChainStateType tx
    -- ...
    }
  • We provide the latest ChainStateType tx to postTx:
    postTx :: ChainStateType tx -> PostChainTx tx -> m ()
  • We change the ChainEvent tx data type and callback interface of Chain to:
    data ChainEvent tx
    = Observation
    { observedTx :: OnChainTx tx
    , newChainState :: ChainStateType tx
    }
    | Rollback ChainSlot
    | Tick UTCTime

    type ChainCallback tx m = (ChainStateType tx -> Maybe (ChainEvent tx)) -> m ()

with the meaning, that invocation of the callback indicates receival of a transaction which is Maybe observing a relevant ChainEvent tx, where an Observation may include a newChainState.

  • We also decide to extend OnChainEffect with a ChainState tx to explicitly thread the used chainState in the Hydra.HeadLogic.

Consequences

  • We need to change the construction of Chain handles and the call sites of postTx
  • We need to extract the state handling (similar to the event queue) out of the HydraNode handle and shuffle the main of hydra-node a bit to be able to provide the latest ChainState to the chain callback as a continuation.
  • We need to make the ChainState serializable (ToJSON, FromJSON) as it will be part of the HeadState.
  • We can drop the TVar of keeping OnChainHeadState in the Hydra.Chain.Direct module.
  • We need to update DirectChainSpec and BehaviorSpec test suites to mock/implement the callback & state handling.
  • We might be able to simplify the ChainState tx to be just a UTxOType tx later.
  • As OnChainEffect and Observation values will contain a ChainStateType tx, traces will automatically include the full ChainState, which might be helpful but also possible big.

Alternative

  • We could extend PostChainTx (like Observation) with ChainState and keep the signatures:
postTx :: MonadThrow m => PostChainTx tx -> m ()
type ChainCallback tx m = (ChainState tx -> Maybe (ChainEvent tx) -> m ()
  • Not implemented as it is less clear on the need for a ChainState in the signatures.

27. Network failures model

· 5 min read
Arnaud Bailly
Lead Architect
Pascal Grange
Senior Software Engineer

Status

Superseded by ADR 32

Context

The current Head cluster is very fragile as has been observed on several occasions: A single hiccup in the connectivity between nodes while a head is open and nodes are exchanging messages can very easily lead to the Head being stuck and require an emergency closing, possibly even manually.

We want Hydra to be Consistent in the presence of Network Partitions, under the fail-recovery model assumption, eg. processes may fail by stopping and later recovering. Our system lies in the CP space of the landscape mapped by the CAP theorem.

We have identified 3 main sources of failures in the fail-recovery model that can lead to a head being stuck:

  1. The network layer can drop messages from the moment a node broadcasts it, leading to some messages not being received at the other end
  2. The sending node can crash in between the moment the state is changed (and persisted) and the moment a message is actually sent through the network (or even when it calls broadcast)
  3. The receiving node can crash in between the moment the message has been received in the network layer, and it's processed (goes through the queue)

We agree that we'll want to address all those issues in order to provide a good user experience, as not addressing 2. and 3. can lead to hard to troubleshoot issues with heads. We have not experienced those issues yet as they would probably only crop up under heavy loads, or in the wild. But we also agree we want to tackle 1. first because it's where most of the risk lies. By providing a Reliable Broadcast layer, we will significantly reduce the risks and can then later on address the other points.

Therefore, the scope of this ADR is to address only point 1. above: Ensure broadcast messages are eventually received by all peers, given the sender does not stop before.

Discussion

  • We are currently using the ouroboros-framework and typed-protocols network stack as a mere transport layer.
    • Being built on top of TCP, ouroboros multiplexer (Mux) provides the same reliability guarantees, plus the multiplexing capabilities of course
    • It also takes care of reconnecting to peers when a failure is detected which relieves us from doing so, but any reconnection implies a reset of each peer's state machine which means we need to make sure any change to the state of pending/received messages is handled by the applicative layer
    • Our FireForget protocol ignores connections/disconnections
    • Ouroboros/typed-protocols provides enough machinery to implement a reliable broadcast protocol, for example by reusing existing [KeepAlive](https://github.com/input-output-hk/ouroboros-network/tree/master/ouroboros-network-protocols/src/Ouroboros/Network/Protocol/KeepAlive) protocol and building a more robust point-to-point protocol than what we have now
    • There is a minor limitation, namely that the subscription mechanism does not handle connections invidually, but as a set of equivalent point-to-point full duplex connections whose size (valency) needs to be maintained at a certain threshold, which means that unless backed in the protocol itself, protocol state-machine and applications are not aware of the identity of the remote peer
  • We have built our Network infrastructure over the concept of relatively independent layers, each implementing a similar interface with different kind of messages, to broadcast messages to all peers and be notified of incoming messages through a callback.
    • This pipes-like abstraction allows us to compose our network stack like:

       withAuthentication (contramap Authentication tracer) signingKey otherParties $
      withHeartbeat nodeId connectionMessages $
      withOuroborosNetwork (contramap Network tracer) localhost peers
    • This has the nice property that we can basically swap the lower layers should we need to, for example to use UDP, or add other layers for example to address specific head instances in presence of multiple heads

Decision

  • We implement our own message tracking and resending logic as a standalone Network layer
  • That layer consumes and produces Authenticated msg messages as it relies on identifying the source of messages
  • It uses a vector of monotonically increasing sequence numbers associated with each party (including itself) to track what are the last messages from each party and to ensure FIFO delivery of messages
    • This vector is used to identify peers which are lagging behind, resend the missing messages, or to drop messages which have already been received
    • The Heartbeat mechanism is relied upon to ensure dissemination of state even when the node is quiescent
  • We do not implement a pull-based message communication mechanism as initially envisioned
  • We do not persist messages either on the receiving or sending side at this time

Consequences

  • We keep our existing Network interface hence all messages will be resent to all peers
    • This could be later optimized either by providing a smarter interface with a send :: Peer -> msg -> m () unicast function, or by adding a layer with filtering capabilities, or both
  • We want to specify this protocol clearly in order to ease implementation in other languages, detailing the structure of messages and the semantics of retries and timeouts.
  • We may consider relying on the vector clock in the future to ensure perfect ordering of messages on each peer and make impossible for legit transactions to be temporarily seen as invalid. This can happen in the current version and is handled through wait and TTL