⚠ This page is served via a proxy. Original site: https://github.com
This service does not collect credentials or authentication data.
Skip to content

Conversation

@stefnestor
Copy link
Contributor

Summary

Follow-up to #4592, when I originally wrote hot spotting (elastic/elasticsearch#95429), we didn't have task queue backlog so some of its kind of content ended up over there.

Undoes that and adds in content related to

Generative AI disclosure

  1. Did you use a generative AI (GenAI) tool to assist in creating this contribution?
  • Yes
  • [X ] No

@github-actions
Copy link
Contributor

github-actions bot commented Jan 15, 2026

Vale Linting Results

Summary: 2 warnings, 6 suggestions found

⚠️ Warnings (2)
File Line Rule Message
troubleshoot/elasticsearch/hotspotting.md 30 Elastic.Latinisms Latin terms and abbreviations are a common source of confusion. Use 'using' instead of 'via'.
troubleshoot/elasticsearch/task-queue-backlog.md 128 Elastic.BritishSpellings Use American English spelling 'behavior' instead of British English 'behaviour'.
💡 Suggestions (6)
File Line Rule Message
troubleshoot/elasticsearch/hotspotting.md 30 Elastic.WordChoice Consider using 'can, might' instead of 'may', unless the term is in the UI.
troubleshoot/elasticsearch/hotspotting.md 185 Elastic.FutureTense 'will most' might be in future tense. Write in the present tense to describe the state of the product as it is now.
troubleshoot/elasticsearch/hotspotting.md 185 Elastic.FutureTense 'will surface' might be in future tense. Write in the present tense to describe the state of the product as it is now.
troubleshoot/elasticsearch/task-queue-backlog.md 15 Elastic.Wordiness Consider using 'many' instead of 'a large number of'.
troubleshoot/elasticsearch/task-queue-backlog.md 97 Elastic.FutureTense 'will contain' might be in future tense. Write in the present tense to describe the state of the product as it is now.
troubleshoot/elasticsearch/task-queue-backlog.md 130 Elastic.Wordiness Consider using 'sometimes' instead of 'In some cases'.

The Vale linter checks documentation changes against the Elastic Docs style guide.

To use Vale locally or report issues, refer to Elastic style guide for Vale.

@github-actions
Copy link
Contributor

github-actions bot commented Jan 15, 2026

@stefnestor stefnestor added enhancement New feature or request supportability ability enable self-service or support of product labels Jan 15, 2026
@kilfoyle kilfoyle self-assigned this Jan 15, 2026
* [Check the thread pool status](#diagnose-task-queue-thread-pool)
* [Inspect hot threads on each node](#diagnose-task-queue-hot-thread)
* [Check thread pool status](#diagnose-task-queue-thread-pool)
* [Inspect node hot threads](#diagnose-task-queue-hot-thread)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually prefer the original here, but maybe they're not technically accurate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fair. I was trying to get the TOC to stop line-overflowing & then afterwards was trying to surface that you care if any threads are caught regardless of if individual node related. But TBF I don't feel overly strongly, I probably just got caught away with edits.

@kilfoyle
Copy link
Contributor

Hi @stefnestor, thanks for another really nice add to our docs!

I've added a bunch of comments, but overall LGTM! 🎸

Co-authored-by: David Kilfoyle <[email protected]>
Copy link
Contributor

@kilfoyle kilfoyle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! 🏎️

@rodrigomadalozzo
Copy link
Contributor

Hello team,
@stefnestor @kilfoyle

My suggestions are below. Apologies for adding them as a comment instead of editing the page directly—since there is already a proposed revised version in progress, I didn’t want to introduce conflicting changes. Please consider these suggestions as additions on top of the proposed updates referenced above.

1) Reorder sections for a more logical flow

Since this document focuses on task queue backlog, I suggest reordering the sections so the two that rely on the Task Management API are grouped together. Proposed order:

  1. Identify long-running node tasks
  2. Look for long-running cluster tasks
  3. Check the thread pool status
  4. Inspect hot threads on each node
    This keeps task-focused diagnostics together, then moves into thread pool/backpressure and CPU-level investigation.

2) Clarify the interpretation of active and queue under Check the thread pool status

The current text says:

Look for high active and queue metrics, which indicate potential bottlenecks and opportunities to reduce CPU usage.

I noticed there were some updates on the proposed change, and the wording has improved since it was somewhat vague.
Additionally, I recommend including queue_size in the default command and adding a sort, so “high queue” is meaningful in context (for example, 50/1000 vs 900/1000).

3) Replace the “For example…” sentence under Inspect hot threads on each node with a more actionable snippet

I suggest removing:

For example, if the hot threads response indicates the thread is performing a search query, you can check for long-running search tasks using the task management API.

…and replacing it with the following, which is clearer and provides a concrete example:

For example, if hot threads suggest the node is spending time in search, filter the Task Management API to list long-running search tasks on that node. If it suggests indexing/ingest or write activity, filter for write tasks.

GET /_nodes/hot_threads

# Search saturation (example)
GET /_tasks?detailed=true&actions=indices:*search*&nodes=<hot_node_id_or_name>

# Write / ingest saturation (example)
GET /_tasks?detailed=true&actions=indices:*write*&nodes=<hot_node_id_or_name>

Review tasks with the longest running_time_in_nanos (and cancellable=true where applicable) to decide whether to tune workload or cancel non-critical tasks.

4) Broaden the phrasing under “Recommendations” to avoid implying only two remediation options

Current sentence:

After identifying problematic threads and tasks, resolve the issue by increasing resources or canceling tasks.

This reads as if there are only two levers (“add resources” or “cancel”), while now there are more (prior to the modifications, there were only those 2). Suggested replacement:

After identifying problematic threads and tasks, address the underlying cause using the recommendations below.

@kilfoyle
Copy link
Contributor

@rodrigomadalozzo your suggestions look really good to me. If you don't mind adding them into the PR I'll be happy to re-review / approve. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request supportability ability enable self-service or support of product

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants