ethmonitor: add poc for goroutine leak during shutdown #197

xiam · 2026-01-21T18:47:27Z

Fixed goroutine leaks during ethmonitor shutdown. This caused deadlocks under specific race conditions (e.g. if the receiver exited first)

Copilot

Pull request overview

This PR adds a proof-of-concept test to demonstrate a goroutine leak that occurs during monitor shutdown in the ethmonitor package. The test uses mocked dependencies to simulate error conditions that trigger the leak.

Changes:

Adds go.uber.org/mock v0.6.0 dependency for test mocking support
Generates a mock implementation of the RawInterface for testing
Implements a POC test that demonstrates goroutines not being properly cleaned up on shutdown

Reviewed changes

Copilot reviewed 3 out of 4 changed files in this pull request and generated 3 comments.

File	Description
go.mod	Adds go.uber.org/mock v0.6.0 as a test dependency
go.sum	Adds checksums for the new mock dependency
ethmonitor/internal/mocks/mock_provider.go	Auto-generated mock implementation of ethrpc.RawInterface using mockgen
ethmonitor/goroutine_leak_poc_test.go	POC test demonstrating goroutine leak during shutdown with mocked error conditions

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-21T18:51:51Z

ethmonitor/goroutine_leak_poc_test.go

+	provider.EXPECT().RawBlockByNumber(gomock.Any(), gomock.Any()).
+		Return(nil, fmt.Errorf("simulated network error")).AnyTimes()
+
+	initialGoroutines := runtime.NumGoroutine()


The goroutine count baseline is captured before the test's own goroutine (lines 47-49) is started. This can lead to false positives in the leak detection. The baseline should be captured after starting the monitor goroutine, or the test should account for the test goroutine itself. Consider moving this line after line 49, or adding a small delay to ensure all test infrastructure goroutines are stable.

Copilot · 2026-01-21T18:51:51Z

ethmonitor/goroutine_leak_poc_test.go

+	time.Sleep(500 * time.Millisecond)
+


The test relies on hard-coded sleep durations which can cause flakiness in CI environments or under load. The 200ms sleep before cancellation (line 51) and the 500ms sleep after shutdown (line 60) may not be sufficient in all environments. Consider using synchronization primitives or runtime.GC() followed by a shorter sleep to ensure goroutines have exited, rather than relying solely on fixed time delays.

Suggested change

time.Sleep(500 * time.Millisecond)

// Wait for goroutines to exit, allowing for GC and load variability.

const (

maxWait = 5 * time.Second

pollDelay = 50 * time.Millisecond

)

deadline := time.Now().Add(maxWait)

for {

runtime.GC()

time.Sleep(pollDelay)

if runtime.NumGoroutine() <= initialGoroutines || time.Now().After(deadline) {

break

}

}

Copilot · 2026-01-21T18:51:51Z

ethmonitor/goroutine_leak_poc_test.go

+	}
+
+	time.Sleep(500 * time.Millisecond)
+


The test doesn't explicitly call runtime.GC() and give goroutines time to be garbage collected before checking for leaks. The Go runtime may not immediately clean up exited goroutines. Consider adding runtime.GC() and a small sleep (e.g., 100ms) before the leak check to ensure more accurate results.

Suggested change

// Force garbage collection and give it a moment to complete before checking for leaks.

runtime.GC()

time.Sleep(100 * time.Millisecond)

Copilot

Pull request overview

Copilot reviewed 3 out of 4 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-21T23:32:11Z

ethmonitor/goroutine_leak_poc_test.go

+		done <- monitor.Run(ctx)
+	}()
+
+	time.Sleep(200 * time.Millisecond)


The sleep duration of 200ms is arbitrary and may be insufficient on slower systems to reliably trigger the goroutine leak, or may be excessive on faster systems. Consider either increasing the duration to ensure the monitor enters its run loop, or using a more deterministic synchronization mechanism such as checking monitor.IsRunning() in a loop with a timeout.

Suggested change

time.Sleep(200 * time.Millisecond)

waitCtx, waitCancel := context.WithTimeout(context.Background(), time.Second)

defer waitCancel()

for {

if monitor.IsRunning() {

break

}

select {

case <-waitCtx.Done():

t.Fatalf("monitor did not start running within timeout: %v", waitCtx.Err())

case <-time.After(10 * time.Millisecond):

}

}

Not worth it in this context.

Copilot · 2026-01-21T23:32:11Z

ethmonitor/goroutine_leak_poc_test.go

+	time.Sleep(100 * time.Millisecond)
+


The sleep duration of 100ms after GC may be insufficient for goroutines to fully clean up. Goroutine leak detection is inherently racy, and a more robust approach would be to poll runtime.NumGoroutine() in a loop with a reasonable timeout (e.g., 1 second) until the count returns to baseline or the timeout expires.

Suggested change

time.Sleep(100 * time.Millisecond)

// Poll for goroutine count to return to baseline with a reasonable timeout

deadline := time.Now().Add(1 * time.Second)

for time.Now().Before(deadline) {

if runtime.NumGoroutine() <= baseline {

break

}

time.Sleep(10 * time.Millisecond)

}

Copilot

Pull request overview

Copilot reviewed 4 out of 5 changed files in this pull request and generated 1 comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

ethmonitor/goroutine_leak_poc_test.go

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copilot

Pull request overview

Copilot reviewed 4 out of 5 changed files in this pull request and generated 1 comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-22T15:36:54Z

ethmonitor/ethmonitor.go

+				select {
+				case ch <- latestBlockNum:
+				case <-m.ctx.Done():
+					return
+				}


The second select statement at line 493-497 is redundant and creates an unnecessary race condition. After successfully receiving from the nextBlock channel (line 488-492), the code immediately tries to send on ch. However, if the context is cancelled between these two select statements, the goroutine will return without sending on ch, which could cause the receiver to miss the block number. Consider combining these into a single select that handles both the receive and the send atomically, or ensure that the value is sent before checking context cancellation again.

Suggested change

select {

case ch <- latestBlockNum:

case <-m.ctx.Done():

return

}

// once we've observed the next block, always forward latestBlockNum

// to ch before checking context cancellation again

ch <- latestBlockNum

I think the comment is accurate in the sense that the goroutine returns without sending on ch, but if we're shutting down that's acceptable. I don't think we should act on this comment, but I'd defer that decision to the reviewers.

ethmonitor: add poc for goroutine leak during shutdown

b8be9dc

xiam requested a review from Copilot January 21, 2026 18:48

Copilot started reviewing on behalf of xiam January 21, 2026 18:48 View session

Copilot AI reviewed Jan 21, 2026

View reviewed changes

ethmonitor: fixes after code review

8734b3b

xiam requested a review from Copilot January 21, 2026 23:29

Copilot started reviewing on behalf of xiam January 21, 2026 23:29 View session

Copilot AI reviewed Jan 21, 2026

View reviewed changes

ethmonitor: fix goroutine leak

218cfec

xiam requested a review from Copilot January 22, 2026 00:07

Copilot started reviewing on behalf of xiam January 22, 2026 00:07 View session

Copilot AI reviewed Jan 22, 2026

View reviewed changes

ethmonitor/goroutine_leak_poc_test.go Outdated Show resolved Hide resolved

rename PoC test to TestMonitorShutdownNoGoroutineLeak

9f78a00

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

xiam marked this pull request as ready for review January 22, 2026 14:51

xiam requested a review from a team January 22, 2026 14:51

xiam marked this pull request as draft January 22, 2026 14:52

ethmonitor: fix goroutine leak

2003fb6

xiam requested a review from Copilot January 22, 2026 15:03

Copilot started reviewing on behalf of xiam January 22, 2026 15:03 View session

Copilot AI reviewed Jan 22, 2026

View reviewed changes

xiam requested a review from Copilot January 22, 2026 15:33

Copilot started reviewing on behalf of xiam January 22, 2026 15:33 View session

Copilot AI reviewed Jan 22, 2026

View reviewed changes

klaidliadon approved these changes Jan 22, 2026

View reviewed changes

-	time.Sleep(500 * time.Millisecond)
+	// Wait for goroutines to exit, allowing for GC and load variability.
+	const (
+		maxWait   = 5 * time.Second
+		pollDelay = 50 * time.Millisecond
+	)
+	deadline := time.Now().Add(maxWait)
+	for {
+		runtime.GC()
+		time.Sleep(pollDelay)
+		if runtime.NumGoroutine() <= initialGoroutines || time.Now().After(deadline) {
+			break
+		}
+	}

-	time.Sleep(200 * time.Millisecond)
+	waitCtx, waitCancel := context.WithTimeout(context.Background(), time.Second)
+	defer waitCancel()
+	for {
+		if monitor.IsRunning() {
+			break
+		}
+		select {
+		case <-waitCtx.Done():
+			t.Fatalf("monitor did not start running within timeout: %v", waitCtx.Err())
+		case <-time.After(10 * time.Millisecond):
+		}
+	}

-	time.Sleep(100 * time.Millisecond)
+	// Poll for goroutine count to return to baseline with a reasonable timeout
+	deadline := time.Now().Add(1 * time.Second)
+	for time.Now().Before(deadline) {
+		if runtime.NumGoroutine() <= baseline {
+			break
+		}
+		time.Sleep(10 * time.Millisecond)
+	}

ethmonitor: add poc for goroutine leak during shutdown #197

Are you sure you want to change the base?

ethmonitor: add poc for goroutine leak during shutdown #197

Uh oh!

Conversation

xiam commented Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

xiam Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

xiam Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

xiam commented Jan 21, 2026 •

edited

Loading

xiam Jan 22, 2026 •

edited

Loading