Workers Auto Approve
Mesh Testing, Feat/stability
Linked to https://github.com/MetaMask/mesh-testing/pull/53
DO NOT MERGE YET - experimental branch to work out stability issues in libp2p.
For the last few days I've been troubleshooting some stability issues with libp2p, I'll describe what those are and what the possible fixes for them are as well bellow:
## libp2p connection and stability issues
The current issues with connection management in libp2p prevent it from being able to connect to more than ~10 simultaneous peers, I've observed that after we reach that threshold things become very unstable and we're no longer able to send messages over pubsub reliably, as well as maintain those connections open for any significant period of time (connection drops). This is due to several issues:
- Physical (non muxed) connections don't get reused correctly, they instead get thrown away on each dial, which under certain situations can lead to connections piling up, which eventually backups the connection queue (connection backlog). The particular line where this _might_ be happening is this - https://github.com/libp2p/js-libp2p-switch/blob/master/src/dial.js#L184.
- Too many concurrent dials and low connection timeout. I'm not entirely sure whether this is a real issue just yet, but I've seen improvements lowering the number of concurrent dials done by libp2p-switch (https://github.com/libp2p/js-libp2p-switch/blob/master/src/transport.js#L11) to ~2 as well as increasing the timeout to ~2mins (https://github.com/libp2p/js-libp2p-switch/blob/master/src/transport.js#L15), in some cases. This is not conclusive and might be affected by other factors, but definitely something to keep an eye on.
- Wrong way of closing connections. We don't seem to be closing connections properly in libp2p when the connection manager signals a disconnect, we do close muxed connection, but the physical connection is left dangling in some cases, leading to connection pile up. This should not happen, as destroying the muxed should in theory destroy the connection but I've seen a number of cases where this doesn't happen. In any case, I believe we should have an explicit `Connection.destroy` in the `interface-connection` code to ensure that the connection closes properly in all cases.
- No way of detecting stale connections. Libp2p supports a variety of transports, and not all of those transports have a way of detecting weather the connection has gone stale (other side died/dropped), for this I believe we need a heartbeat mechanism that would detect and disconnect stale/dangling connections and perform the required/correct steps needed to properly clean them up.
## Action items:
- Fix the dial flow to ensure that connections don't get lost/untracked before making sure they are properly cleaned up
- Fix all possible disconnect issue
- Add a heartbeat mechanism
- Experiment with `Connection.destroy()` in `interface-connection`
One thing that puzzled me while troubleshooting this, was that our deployed mesh doesn't seem to be running into this issues and we're able to maintain a stable mesh with well over ~100 simultaneous peers, but once I tried adding a circuit-relay node to the mesh everything would start falling apart. The reason is that we're keeping the amount of concurrent connections in libp2p-wrtc-star to 4, this due to wrtc specific limitations in the browser. This effectively mitigates the issues listed above when using wrtc, but shows up as soon as another transport is used, in this case websocket, which is used to connect to the relay node.
Disclaimer: It's a bit difficult to pinpoint this issues when so many things are happening at the same time, hence some of my observations/conclusions might be off, but I believe I've identified at least some of the issues. The action items above should increase stability considerably.
Setup your profile
Tell us a little about you:
No results found for
Type to search skills..
Required [[totalcharacter]] / 240
Are you currently looking for work?
[[ option.string ]]
Setup your profile
Our tools are based on the principles of earn (💰), learn (📖), and meet (💬).
Select the ones you are interested in. You can change it later in your settings.
I'm also an organization manager looking for a great community.
Enable your organization profile
Gitcoin products can help grow community around your brand. Create your tribe, events, and incentivize your community with bounties. Announce new and upcoming events using townsquare. Find top-quality hackers and fund them to work with you on a grant.
These are the organizations you own. If you don't see your organization here please be sure that information is public on your GitHub profile. Gitcoin will sync this information for you.
Select the products you are interested in:
Out of the box you will receive Tribes Lite for your organization. Please provide us with a contact email: