⚠ This page is served via a proxy. Original site: https://github.com
This service does not collect credentials or authentication data.
Skip to content

Conversation

@xiam
Copy link
Contributor

@xiam xiam commented Jan 21, 2026

Fixed goroutine leaks during ethmonitor shutdown. This caused deadlocks under specific race conditions (e.g. if the receiver exited first)

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a proof-of-concept test to demonstrate a goroutine leak that occurs during monitor shutdown in the ethmonitor package. The test uses mocked dependencies to simulate error conditions that trigger the leak.

Changes:

  • Adds go.uber.org/mock v0.6.0 dependency for test mocking support
  • Generates a mock implementation of the RawInterface for testing
  • Implements a POC test that demonstrates goroutines not being properly cleaned up on shutdown

Reviewed changes

Copilot reviewed 3 out of 4 changed files in this pull request and generated 3 comments.

File Description
go.mod Adds go.uber.org/mock v0.6.0 as a test dependency
go.sum Adds checksums for the new mock dependency
ethmonitor/internal/mocks/mock_provider.go Auto-generated mock implementation of ethrpc.RawInterface using mockgen
ethmonitor/goroutine_leak_poc_test.go POC test demonstrating goroutine leak during shutdown with mocked error conditions

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

provider.EXPECT().RawBlockByNumber(gomock.Any(), gomock.Any()).
Return(nil, fmt.Errorf("simulated network error")).AnyTimes()

initialGoroutines := runtime.NumGoroutine()
Copy link

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goroutine count baseline is captured before the test's own goroutine (lines 47-49) is started. This can lead to false positives in the leak detection. The baseline should be captured after starting the monitor goroutine, or the test should account for the test goroutine itself. Consider moving this line after line 49, or adding a small delay to ensure all test infrastructure goroutines are stable.

Copilot uses AI. Check for mistakes.
Comment on lines 60 to 61
time.Sleep(500 * time.Millisecond)

Copy link

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test relies on hard-coded sleep durations which can cause flakiness in CI environments or under load. The 200ms sleep before cancellation (line 51) and the 500ms sleep after shutdown (line 60) may not be sufficient in all environments. Consider using synchronization primitives or runtime.GC() followed by a shorter sleep to ensure goroutines have exited, rather than relying solely on fixed time delays.

Suggested change
time.Sleep(500 * time.Millisecond)
// Wait for goroutines to exit, allowing for GC and load variability.
const (
maxWait = 5 * time.Second
pollDelay = 50 * time.Millisecond
)
deadline := time.Now().Add(maxWait)
for {
runtime.GC()
time.Sleep(pollDelay)
if runtime.NumGoroutine() <= initialGoroutines || time.Now().After(deadline) {
break
}
}

Copilot uses AI. Check for mistakes.
}

time.Sleep(500 * time.Millisecond)

Copy link

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test doesn't explicitly call runtime.GC() and give goroutines time to be garbage collected before checking for leaks. The Go runtime may not immediately clean up exited goroutines. Consider adding runtime.GC() and a small sleep (e.g., 100ms) before the leak check to ensure more accurate results.

Suggested change
// Force garbage collection and give it a moment to complete before checking for leaks.
runtime.GC()
time.Sleep(100 * time.Millisecond)

Copilot uses AI. Check for mistakes.
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 4 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

done <- monitor.Run(ctx)
}()

time.Sleep(200 * time.Millisecond)
Copy link

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sleep duration of 200ms is arbitrary and may be insufficient on slower systems to reliably trigger the goroutine leak, or may be excessive on faster systems. Consider either increasing the duration to ensure the monitor enters its run loop, or using a more deterministic synchronization mechanism such as checking monitor.IsRunning() in a loop with a timeout.

Suggested change
time.Sleep(200 * time.Millisecond)
waitCtx, waitCancel := context.WithTimeout(context.Background(), time.Second)
defer waitCancel()
for {
if monitor.IsRunning() {
break
}
select {
case <-waitCtx.Done():
t.Fatalf("monitor did not start running within timeout: %v", waitCtx.Err())
case <-time.After(10 * time.Millisecond):
}
}

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not worth it in this context.

Comment on lines +61 to +62
time.Sleep(100 * time.Millisecond)

Copy link

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sleep duration of 100ms after GC may be insufficient for goroutines to fully clean up. Goroutine leak detection is inherently racy, and a more robust approach would be to poll runtime.NumGoroutine() in a loop with a reasonable timeout (e.g., 1 second) until the count returns to baseline or the timeout expires.

Suggested change
time.Sleep(100 * time.Millisecond)
// Poll for goroutine count to return to baseline with a reasonable timeout
deadline := time.Now().Add(1 * time.Second)
for time.Now().Before(deadline) {
if runtime.NumGoroutine() <= baseline {
break
}
time.Sleep(10 * time.Millisecond)
}

Copilot uses AI. Check for mistakes.
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 5 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@xiam xiam marked this pull request as ready for review January 22, 2026 14:51
@xiam xiam requested a review from a team January 22, 2026 14:51
@xiam xiam marked this pull request as draft January 22, 2026 14:52
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 5 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +493 to +497
select {
case ch <- latestBlockNum:
case <-m.ctx.Done():
return
}
Copy link

Copilot AI Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The second select statement at line 493-497 is redundant and creates an unnecessary race condition. After successfully receiving from the nextBlock channel (line 488-492), the code immediately tries to send on ch. However, if the context is cancelled between these two select statements, the goroutine will return without sending on ch, which could cause the receiver to miss the block number. Consider combining these into a single select that handles both the receive and the send atomically, or ensure that the value is sent before checking context cancellation again.

Suggested change
select {
case ch <- latestBlockNum:
case <-m.ctx.Done():
return
}
// once we've observed the next block, always forward latestBlockNum
// to ch before checking context cancellation again
ch <- latestBlockNum

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

@xiam xiam Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the comment is accurate in the sense that the goroutine returns without sending on ch, but if we're shutting down that's acceptable. I don't think we should act on this comment, but I'd defer that decision to the reviewers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants