⚠ This page is served via a proxy. Original site: https://github.com
This service does not collect credentials or authentication data.
Skip to content

Conversation

@javanlacerda
Copy link
Collaborator

@javanlacerda javanlacerda commented Feb 1, 2026

SHOULD BE MERGED AFTER #5150.

It updates the logic for scheduling fuzz tasks. Instead of aiming to load the GCP Batch infrastructure based on the available CPUs, it looks only to the preprocess queue size.

By default, it aims to keep the preprocess queue with 10k messages, creating the number of tasks based on the difference between the aim value and the current value.

image

@javanlacerda javanlacerda marked this pull request as ready for review February 4, 2026 16:51
@jonathanmetzman
Copy link
Collaborator

I'm not going to do a thorough review but will share context to avoid a production incident.
I think it's likely there will be infinite queuing if #5140 is not landed before this.


def count_unacked(creds, project_id, subscription_id):
"""Counts the unacked messages in |subscription_id|."""
def get_queue_size(creds, project_id, subscription_id):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably worth notiing somewhere that the queue size metric is delayed by about 5 minutes.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I got your point.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I mean that we should mention the need for delays. If the cron runs too often, it might check the queue size before the metric has actually updated to reflect the jobs it just added. Does this make sense?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. But the idea is to tweak this by using the feature flags to have the balance on the size of the queue and the frequency of the cron execution.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if I understood it correctly, the point is to not run schedule_fuzz too often, right?

@jonathanmetzman
Copy link
Collaborator

I'd run this locally using butler scripts to make sure it behaves nicely.

@javanlacerda
Copy link
Collaborator Author

javanlacerda commented Feb 4, 2026

I'd run this locally using butler scripts to make sure it behaves nicely.

Its running in dev for 4 days already :)

@javanlacerda
Copy link
Collaborator Author

I'm not going to do a thorough review but will share context to avoid a production incident. I think it's likely there will be infinite queuing if #5140 is not landed before this.

Will not because this should be landed only after #5150. And even we don't have the job limiter for batch yet, we can control how many tasks will be forward to there by the RemoteTaskGate frequencies.

@cemon721-a11y

This comment was marked as spam.

@jonathanmetzman
Copy link
Collaborator

jonathanmetzman commented Feb 4, 2026

I'm not going to do a thorough review but will share context to avoid a production incident. I think it's likely there will be infinite queuing if #5140 is not landed before this.

Will not because this should be landed only after #5150. And even we don't have the job limiter for batch yet, we can control how many tasks will be forward to there by the RemoteTaskGate frequencies.

Great. I haven't had a chance to review/understand that PR, but are you relying sending everything to k8s to avoide infinite queuing in batch? Otherwise I don't see how we ever know if batch is full/queueing without querying the batch API. I trust you!

@cemon721-a11y

This comment was marked as spam.

Signed-off-by: Javan Lacerda <javanlacerda@google.com>

fixes

Signed-off-by: Javan Lacerda <javanlacerda@google.com>
@javanlacerda javanlacerda force-pushed the javan.schedule-fuzz-queue-size branch from c58e4bb to ffb5cc7 Compare February 4, 2026 20:47
@javanlacerda
Copy link
Collaborator Author

/gcbrun

javanlacerda added a commit that referenced this pull request Feb 5, 2026
This PR enhances the reliability of remote task scheduling by
introducing queue size limits and handling uncreated tasks via lease
cancellation. It refactors RemoteTaskGate, GcpBatchService, and
KubernetesService to return tasks that fail scheduling, ensuring they
are redelivered by Pub/Sub.

If the task cannot be scheduled to the runtime, it returns to utask-main
scheduler queue.

Also, the UTask execute function raises an exception if the utask-main
schedule queue size is higher than a threshold. This exception makes the
preprocess task to not be acked.

Finally, this follow up PR
#5153 addresses the
schedule-fuzz implementation to avoid scheduling more tasks if the
preprocess queue size is also bigger then a threshold.

This entire logic described was tested in the dev environment. Follow
the evidences:

Then the Kubernetes job limiter was reached, the utask-main queue
started to grow due to the not acked tasks. It stopped close to 10k
messages, because the UTask execute function only runs the preprocess if
the utask-main queue is under 10k.
<img width="1084" height="771" alt="image"
src="https://github.com/user-attachments/assets/85ccc0b0-ecd6-4120-9c1c-fe3f9d2b54a7"
/>


As the tworkers stopped run preprocess, the preprocess queue started to
grow. Note that the preprocess queue size stick close to 10K as well. It
happens because the schedule-fuzz only creates the tasks based on
preprocess queue size, aiming to have 10k messages there. It is being
addressed in this PR #5153.
<img width="1084" height="771" alt="image"
src="https://github.com/user-attachments/assets/c18bde50-9a27-4d46-80c4-8ee7af57718a"
/>

---------

Signed-off-by: Javan Lacerda <javanlacerda@google.com>

# TODO(metzman): Actually implement this.
CPUS_PER_FUZZ_JOB = 2
PREPROCESS_TARGET_SIZE_DEFAULT = 10000
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did you reach this number? Shouldn't the default depend on the project config(external, internal, ...)?

Also, maybe we should create a rationale to define this number based on the amount of tworkers + batch/k8s quota

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its an arbitrary default number, it will be controlled through the feature flag. And indeed we will tweak each project based on on the fleet.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think that regardless of the feature flag, we should have a default defined on the project config (similar to what was done with the cpu count)

Copy link
Collaborator

@ViniciustCosta ViniciustCosta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving this to unblock the architecture changes so we can monitor them in prod.

I'm still not sure how this will not overload batch if batch does not have the job limiter implemented yet (and scheduling based on the preprocess queue size seems to have this as an assumption to work).

@javanlacerda
Copy link
Collaborator Author

Approving this to unblock the architecture changes so we can monitor them in prod.

I'm still not sure how this will not overload batch if batch does not have the job limiter implemented yet (and scheduling based on the preprocess queue size seems to have this as an assumption to work).

I've described the rationale here #5150 (comment)

@javanlacerda javanlacerda merged commit f96a469 into master Feb 5, 2026
11 checks passed
@javanlacerda javanlacerda deleted the javan.schedule-fuzz-queue-size branch February 5, 2026 16:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants