-
Notifications
You must be signed in to change notification settings - Fork 26
ethmonitor: add poc for goroutine leak during shutdown #197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds a proof-of-concept test to demonstrate a goroutine leak that occurs during monitor shutdown in the ethmonitor package. The test uses mocked dependencies to simulate error conditions that trigger the leak.
Changes:
- Adds go.uber.org/mock v0.6.0 dependency for test mocking support
- Generates a mock implementation of the RawInterface for testing
- Implements a POC test that demonstrates goroutines not being properly cleaned up on shutdown
Reviewed changes
Copilot reviewed 3 out of 4 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| go.mod | Adds go.uber.org/mock v0.6.0 as a test dependency |
| go.sum | Adds checksums for the new mock dependency |
| ethmonitor/internal/mocks/mock_provider.go | Auto-generated mock implementation of ethrpc.RawInterface using mockgen |
| ethmonitor/goroutine_leak_poc_test.go | POC test demonstrating goroutine leak during shutdown with mocked error conditions |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| provider.EXPECT().RawBlockByNumber(gomock.Any(), gomock.Any()). | ||
| Return(nil, fmt.Errorf("simulated network error")).AnyTimes() | ||
|
|
||
| initialGoroutines := runtime.NumGoroutine() |
Copilot
AI
Jan 21, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The goroutine count baseline is captured before the test's own goroutine (lines 47-49) is started. This can lead to false positives in the leak detection. The baseline should be captured after starting the monitor goroutine, or the test should account for the test goroutine itself. Consider moving this line after line 49, or adding a small delay to ensure all test infrastructure goroutines are stable.
| time.Sleep(500 * time.Millisecond) | ||
|
|
Copilot
AI
Jan 21, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test relies on hard-coded sleep durations which can cause flakiness in CI environments or under load. The 200ms sleep before cancellation (line 51) and the 500ms sleep after shutdown (line 60) may not be sufficient in all environments. Consider using synchronization primitives or runtime.GC() followed by a shorter sleep to ensure goroutines have exited, rather than relying solely on fixed time delays.
| time.Sleep(500 * time.Millisecond) | |
| // Wait for goroutines to exit, allowing for GC and load variability. | |
| const ( | |
| maxWait = 5 * time.Second | |
| pollDelay = 50 * time.Millisecond | |
| ) | |
| deadline := time.Now().Add(maxWait) | |
| for { | |
| runtime.GC() | |
| time.Sleep(pollDelay) | |
| if runtime.NumGoroutine() <= initialGoroutines || time.Now().After(deadline) { | |
| break | |
| } | |
| } |
| } | ||
|
|
||
| time.Sleep(500 * time.Millisecond) | ||
|
|
Copilot
AI
Jan 21, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test doesn't explicitly call runtime.GC() and give goroutines time to be garbage collected before checking for leaks. The Go runtime may not immediately clean up exited goroutines. Consider adding runtime.GC() and a small sleep (e.g., 100ms) before the leak check to ensure more accurate results.
| // Force garbage collection and give it a moment to complete before checking for leaks. | |
| runtime.GC() | |
| time.Sleep(100 * time.Millisecond) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 3 out of 4 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| done <- monitor.Run(ctx) | ||
| }() | ||
|
|
||
| time.Sleep(200 * time.Millisecond) |
Copilot
AI
Jan 21, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The sleep duration of 200ms is arbitrary and may be insufficient on slower systems to reliably trigger the goroutine leak, or may be excessive on faster systems. Consider either increasing the duration to ensure the monitor enters its run loop, or using a more deterministic synchronization mechanism such as checking monitor.IsRunning() in a loop with a timeout.
| time.Sleep(200 * time.Millisecond) | |
| waitCtx, waitCancel := context.WithTimeout(context.Background(), time.Second) | |
| defer waitCancel() | |
| for { | |
| if monitor.IsRunning() { | |
| break | |
| } | |
| select { | |
| case <-waitCtx.Done(): | |
| t.Fatalf("monitor did not start running within timeout: %v", waitCtx.Err()) | |
| case <-time.After(10 * time.Millisecond): | |
| } | |
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not worth it in this context.
| time.Sleep(100 * time.Millisecond) | ||
|
|
Copilot
AI
Jan 21, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The sleep duration of 100ms after GC may be insufficient for goroutines to fully clean up. Goroutine leak detection is inherently racy, and a more robust approach would be to poll runtime.NumGoroutine() in a loop with a reasonable timeout (e.g., 1 second) until the count returns to baseline or the timeout expires.
| time.Sleep(100 * time.Millisecond) | |
| // Poll for goroutine count to return to baseline with a reasonable timeout | |
| deadline := time.Now().Add(1 * time.Second) | |
| for time.Now().Before(deadline) { | |
| if runtime.NumGoroutine() <= baseline { | |
| break | |
| } | |
| time.Sleep(10 * time.Millisecond) | |
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 4 out of 5 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 4 out of 5 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| select { | ||
| case ch <- latestBlockNum: | ||
| case <-m.ctx.Done(): | ||
| return | ||
| } |
Copilot
AI
Jan 22, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The second select statement at line 493-497 is redundant and creates an unnecessary race condition. After successfully receiving from the nextBlock channel (line 488-492), the code immediately tries to send on ch. However, if the context is cancelled between these two select statements, the goroutine will return without sending on ch, which could cause the receiver to miss the block number. Consider combining these into a single select that handles both the receive and the send atomically, or ensure that the value is sent before checking context cancellation again.
| select { | |
| case ch <- latestBlockNum: | |
| case <-m.ctx.Done(): | |
| return | |
| } | |
| // once we've observed the next block, always forward latestBlockNum | |
| // to ch before checking context cancellation again | |
| ch <- latestBlockNum |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the comment is accurate in the sense that the goroutine returns without sending on ch, but if we're shutting down that's acceptable. I don't think we should act on this comment, but I'd defer that decision to the reviewers.
Fixed goroutine leaks during ethmonitor shutdown. This caused deadlocks under specific race conditions (e.g. if the receiver exited first)