Development resources at your finger tips
Build with the coolest Web3 projects
Recurring funding for Open Source
Learn about Web3 & earn rewards
Show appreciation for each other
Meet fellow developers, designers, futurists and more. Collaborate and BUIDL awesome projects together.
Discover great web3 organizations, work on meaningful projects and build relationships with like minded people. Browse Tribes
Meet the top hunters and contributors from our community.
Hello, Gitcoiners! At Gitcoin, we love bringing good news — new projects built, relationships formed, skills learned. Even better when we find …
Hello, Gitcoiners & Gitcoinerettes! It’s happening again – happy blockchain times are coming to San Francisco 🎉, as the San Francisco…
Gitcoin is GDPR complaint. Learn more in
Gitcoin's Terms & Conditions.
Check out the Issue Explorer
Looking to fund some work? You can submit a new Funded Issue here.
Linked to https://github.com/MetaMask/mesh-testing/pull/53
DO NOT MERGE YET - experimental branch to work out stability issues in libp2p.
For the last few days I've been troubleshooting some stability issues with libp2p, I'll describe what those are and what the possible fixes for them are as well bellow:
## libp2p connection and stability issues
The current issues with connection management in libp2p prevent it from being able to connect to more than ~10 simultaneous peers, I've observed that after we reach that threshold things become very unstable and we're no longer able to send messages over pubsub reliably, as well as maintain those connections open for any significant period of time (connection drops). This is due to several issues:
- Physical (non muxed) connections don't get reused correctly, they instead get thrown away on each dial, which under certain situations can lead to connections piling up, which eventually backups the connection queue (connection backlog). The particular line where this _might_ be happening is this - https://github.com/libp2p/js-libp2p-switch/blob/master/src/dial.js#L184.
- Too many concurrent dials and low connection timeout. I'm not entirely sure whether this is a real issue just yet, but I've seen improvements lowering the number of concurrent dials done by libp2p-switch (https://github.com/libp2p/js-libp2p-switch/blob/master/src/transport.js#L11) to ~2 as well as increasing the timeout to ~2mins (https://github.com/libp2p/js-libp2p-switch/blob/master/src/transport.js#L15), in some cases. This is not conclusive and might be affected by other factors, but definitely something to keep an eye on.
- Wrong way of closing connections. We don't seem to be closing connections properly in libp2p when the connection manager signals a disconnect, we do close muxed connection, but the physical connection is left dangling in some cases, leading to connection pile up. This should not happen, as destroying the muxed should in theory destroy the connection but I've seen a number of cases where this doesn't happen. In any case, I believe we should have an explicit `Connection.destroy` in the `interface-connection` code to ensure that the connection closes properly in all cases.
- No way of detecting stale connections. Libp2p supports a variety of transports, and not all of those transports have a way of detecting weather the connection has gone stale (other side died/dropped), for this I believe we need a heartbeat mechanism that would detect and disconnect stale/dangling connections and perform the required/correct steps needed to properly clean them up.
## Action items:
- Fix the dial flow to ensure that connections don't get lost/untracked before making sure they are properly cleaned up
- Fix all possible disconnect issue
- Add a heartbeat mechanism
- Experiment with `Connection.destroy()` in `interface-connection`
One thing that puzzled me while troubleshooting this, was that our deployed mesh doesn't seem to be running into this issues and we're able to maintain a stable mesh with well over ~100 simultaneous peers, but once I tried adding a circuit-relay node to the mesh everything would start falling apart. The reason is that we're keeping the amount of concurrent connections in libp2p-wrtc-star to 4, this due to wrtc specific limitations in the browser. This effectively mitigates the issues listed above when using wrtc, but shows up as soon as another transport is used, in this case websocket, which is used to connect to the relay node.
Disclaimer: It's a bit difficult to pinpoint this issues when so many things are happening at the same time, hence some of my observations/conclusions might be off, but I believe I've identified at least some of the issues. The action items above should increase stability considerably.