schedule fuzz based on preprocess queue size #5153

javanlacerda · 2026-02-01T02:02:48Z

SHOULD BE MERGED AFTER #5150.

It updates the logic for scheduling fuzz tasks. Instead of aiming to load the GCP Batch infrastructure based on the available CPUs, it looks only to the preprocess queue size.

By default, it aims to keep the preprocess queue with 10k messages, creating the number of tasks based on the difference between the aim value and the current value.

jonathanmetzman · 2026-02-04T16:58:21Z

I'm not going to do a thorough review but will share context to avoid a production incident.
I think it's likely there will be infinite queuing if #5140 is not landed before this.

jonathanmetzman · 2026-02-04T17:01:58Z

src/clusterfuzz/_internal/cron/schedule_fuzz.py

-
-def count_unacked(creds, project_id, subscription_id):
-  """Counts the unacked messages in |subscription_id|."""
+def get_queue_size(creds, project_id, subscription_id):


It's probably worth notiing somewhere that the queue size metric is delayed by about 5 minutes.

Not sure I got your point.

Sorry, I mean that we should mention the need for delays. If the cron runs too often, it might check the queue size before the metric has actually updated to reflect the jobs it just added. Does this make sense?

That's a good point. But the idea is to tweak this by using the feature flags to have the balance on the size of the queue and the frequency of the cron execution.

Not sure if I understood it correctly, the point is to not run schedule_fuzz too often, right?

jonathanmetzman · 2026-02-04T17:02:26Z

I'd run this locally using butler scripts to make sure it behaves nicely.

javanlacerda · 2026-02-04T17:10:03Z

I'd run this locally using butler scripts to make sure it behaves nicely.

Its running in dev for 4 days already :)

javanlacerda · 2026-02-04T17:15:01Z

I'm not going to do a thorough review but will share context to avoid a production incident. I think it's likely there will be infinite queuing if #5140 is not landed before this.

Will not because this should be landed only after #5150. And even we don't have the job limiter for batch yet, we can control how many tasks will be forward to there by the RemoteTaskGate frequencies.

src/clusterfuzz/_internal/cron/schedule_fuzz.py

jonathanmetzman · 2026-02-04T19:31:24Z

I'm not going to do a thorough review but will share context to avoid a production incident. I think it's likely there will be infinite queuing if #5140 is not landed before this.

Will not because this should be landed only after #5150. And even we don't have the job limiter for batch yet, we can control how many tasks will be forward to there by the RemoteTaskGate frequencies.

Great. I haven't had a chance to review/understand that PR, but are you relying sending everything to k8s to avoide infinite queuing in batch? Otherwise I don't see how we ever know if batch is full/queueing without querying the batch API. I trust you!

Signed-off-by: Javan Lacerda <javanlacerda@google.com> fixes Signed-off-by: Javan Lacerda <javanlacerda@google.com>

javanlacerda · 2026-02-05T12:47:45Z

/gcbrun

This PR enhances the reliability of remote task scheduling by introducing queue size limits and handling uncreated tasks via lease cancellation. It refactors RemoteTaskGate, GcpBatchService, and KubernetesService to return tasks that fail scheduling, ensuring they are redelivered by Pub/Sub. If the task cannot be scheduled to the runtime, it returns to utask-main scheduler queue. Also, the UTask execute function raises an exception if the utask-main schedule queue size is higher than a threshold. This exception makes the preprocess task to not be acked. Finally, this follow up PR #5153 addresses the schedule-fuzz implementation to avoid scheduling more tasks if the preprocess queue size is also bigger then a threshold. This entire logic described was tested in the dev environment. Follow the evidences: Then the Kubernetes job limiter was reached, the utask-main queue started to grow due to the not acked tasks. It stopped close to 10k messages, because the UTask execute function only runs the preprocess if the utask-main queue is under 10k. <img width="1084" height="771" alt="image" src="https://github.com/user-attachments/assets/85ccc0b0-ecd6-4120-9c1c-fe3f9d2b54a7" /> As the tworkers stopped run preprocess, the preprocess queue started to grow. Note that the preprocess queue size stick close to 10K as well. It happens because the schedule-fuzz only creates the tasks based on preprocess queue size, aiming to have 10k messages there. It is being addressed in this PR #5153. <img width="1084" height="771" alt="image" src="https://github.com/user-attachments/assets/c18bde50-9a27-4d46-80c4-8ee7af57718a" /> --------- Signed-off-by: Javan Lacerda <javanlacerda@google.com>

ViniciustCosta · 2026-02-05T16:26:45Z

src/clusterfuzz/_internal/cron/schedule_fuzz.py


-# TODO(metzman): Actually implement this.
-CPUS_PER_FUZZ_JOB = 2
+PREPROCESS_TARGET_SIZE_DEFAULT = 10000


How did you reach this number? Shouldn't the default depend on the project config(external, internal, ...)?

Also, maybe we should create a rationale to define this number based on the amount of tworkers + batch/k8s quota

Its an arbitrary default number, it will be controlled through the feature flag. And indeed we will tweak each project based on on the fleet.

I still think that regardless of the feature flag, we should have a default defined on the project config (similar to what was done with the cpu count)

ViniciustCosta

Approving this to unblock the architecture changes so we can monitor them in prod.

I'm still not sure how this will not overload batch if batch does not have the job limiter implemented yet (and scheduling based on the preprocess queue size seems to have this as an assumption to work).

javanlacerda · 2026-02-05T16:52:00Z

Approving this to unblock the architecture changes so we can monitor them in prod.

I'm still not sure how this will not overload batch if batch does not have the job limiter implemented yet (and scheduling based on the preprocess queue size seems to have this as an assumption to work).

I've described the rationale here #5150 (comment)

javanlacerda mentioned this pull request Feb 2, 2026

Implement Kubernetes job limiter #5150

Merged

javanlacerda marked this pull request as ready for review February 4, 2026 16:51

jonathanmetzman reviewed Feb 4, 2026

View reviewed changes

javanlacerda requested review from ViniciustCosta and decoNR February 4, 2026 17:08

decoNR reviewed Feb 4, 2026

View reviewed changes

This comment was marked as spam.

Sign in to view

schedule fuzz based on preprocess queue size

ffb5cc7

Signed-off-by: Javan Lacerda <javanlacerda@google.com> fixes Signed-off-by: Javan Lacerda <javanlacerda@google.com>

javanlacerda force-pushed the javan.schedule-fuzz-queue-size branch from c58e4bb to ffb5cc7 Compare February 4, 2026 20:47

ViniciustCosta reviewed Feb 5, 2026

View reviewed changes

ViniciustCosta approved these changes Feb 5, 2026

View reviewed changes

javanlacerda merged commit f96a469 into master Feb 5, 2026
11 checks passed

javanlacerda deleted the javan.schedule-fuzz-queue-size branch February 5, 2026 16:52

schedule fuzz based on preprocess queue size #5153

schedule fuzz based on preprocess queue size #5153

Conversation

javanlacerda commented Feb 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jonathanmetzman commented Feb 4, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jonathanmetzman commented Feb 4, 2026

Uh oh!

javanlacerda commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

javanlacerda commented Feb 4, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment was marked as spam.

jonathanmetzman commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as spam.

javanlacerda commented Feb 5, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ViniciustCosta left a comment

Choose a reason for hiding this comment

Uh oh!

javanlacerda commented Feb 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

javanlacerda commented Feb 1, 2026 •

edited

Loading

javanlacerda commented Feb 4, 2026 •

edited

Loading

jonathanmetzman commented Feb 4, 2026 •

edited

Loading