Distinguish+expand Hot Spotting from Task Backlog #4657

stefnestor · 2026-01-15T00:51:51Z

Summary

Follow-up to #4592, when I originally wrote hot spotting (elastic/elasticsearch#95429), we didn't have task queue backlog so some of its kind of content ended up over there.

Undoes that and adds in content related to

Generative AI disclosure

Did you use a generative AI (GenAI) tool to assist in creating this contribution?

Yes
[X ] No

github-actions · 2026-01-15T00:53:04Z

Vale Linting Results

Summary: 2 warnings, 6 suggestions found

⚠️ Warnings (2)

File	Line	Rule	Message
troubleshoot/elasticsearch/hotspotting.md	30	Elastic.Latinisms	Latin terms and abbreviations are a common source of confusion. Use 'using' instead of 'via'.
troubleshoot/elasticsearch/task-queue-backlog.md	128	Elastic.BritishSpellings	Use American English spelling 'behavior' instead of British English 'behaviour'.

💡 Suggestions (6)

File	Line	Rule	Message
troubleshoot/elasticsearch/hotspotting.md	30	Elastic.WordChoice	Consider using 'can, might' instead of 'may', unless the term is in the UI.
troubleshoot/elasticsearch/hotspotting.md	185	Elastic.FutureTense	'will most' might be in future tense. Write in the present tense to describe the state of the product as it is now.
troubleshoot/elasticsearch/hotspotting.md	185	Elastic.FutureTense	'will surface' might be in future tense. Write in the present tense to describe the state of the product as it is now.
troubleshoot/elasticsearch/task-queue-backlog.md	15	Elastic.Wordiness	Consider using 'many' instead of 'a large number of'.
troubleshoot/elasticsearch/task-queue-backlog.md	97	Elastic.FutureTense	'will contain' might be in future tense. Write in the present tense to describe the state of the product as it is now.
troubleshoot/elasticsearch/task-queue-backlog.md	130	Elastic.Wordiness	Consider using 'sometimes' instead of 'In some cases'.

The Vale linter checks documentation changes against the Elastic Docs style guide.

To use Vale locally or report issues, refer to Elastic style guide for Vale.

github-actions · 2026-01-15T00:54:08Z

🔍 Preview links for changed docs

troubleshoot/elasticsearch/hotspotting.md

troubleshoot/elasticsearch/task-queue-backlog.md

kilfoyle · 2026-01-15T17:36:46Z

troubleshoot/elasticsearch/task-queue-backlog.md

-* [Check the thread pool status](#diagnose-task-queue-thread-pool)
-* [Inspect hot threads on each node](#diagnose-task-queue-hot-thread)
+* [Check thread pool status](#diagnose-task-queue-thread-pool)
+* [Inspect node hot threads](#diagnose-task-queue-hot-thread)


I actually prefer the original here, but maybe they're not technically accurate.

That's fair. I was trying to get the TOC to stop line-overflowing & then afterwards was trying to surface that you care if any threads are caught regardless of if individual node related. But TBF I don't feel overly strongly, I probably just got caught away with edits.

troubleshoot/elasticsearch/task-queue-backlog.md

troubleshoot/elasticsearch/hotspotting.md

kilfoyle · 2026-01-15T18:06:53Z

Hi @stefnestor, thanks for another really nice add to our docs!

I've added a bunch of comments, but overall LGTM! 🎸

Co-authored-by: David Kilfoyle <[email protected]>

kilfoyle

LGTM! 🏎️

rodrigomadalozzo · 2026-01-16T12:01:55Z

Hello team,
@stefnestor @kilfoyle

My suggestions are below. Apologies for adding them as a comment instead of editing the page directly—since there is already a proposed revised version in progress, I didn’t want to introduce conflicting changes. Please consider these suggestions as additions on top of the proposed updates referenced above.

1) Reorder sections for a more logical flow

Since this document focuses on task queue backlog, I suggest reordering the sections so the two that rely on the Task Management API are grouped together. Proposed order:

Identify long-running node tasks
Look for long-running cluster tasks
Check the thread pool status
Inspect hot threads on each node
This keeps task-focused diagnostics together, then moves into thread pool/backpressure and CPU-level investigation.

2) Clarify the interpretation of active and queue under Check the thread pool status

The current text says:

Look for high active and queue metrics, which indicate potential bottlenecks and opportunities to reduce CPU usage.

I noticed there were some updates on the proposed change, and the wording has improved since it was somewhat vague.
Additionally, I recommend including queue_size in the default command and adding a sort, so “high queue” is meaningful in context (for example, 50/1000 vs 900/1000).

3) Replace the “For example…” sentence under Inspect hot threads on each node with a more actionable snippet

I suggest removing:

For example, if the hot threads response indicates the thread is performing a search query, you can check for long-running search tasks using the task management API.

…and replacing it with the following, which is clearer and provides a concrete example:

For example, if hot threads suggest the node is spending time in search, filter the Task Management API to list long-running search tasks on that node. If it suggests indexing/ingest or write activity, filter for write tasks.

GET /_nodes/hot_threads

# Search saturation (example)
GET /_tasks?detailed=true&actions=indices:*search*&nodes=<hot_node_id_or_name>

# Write / ingest saturation (example)
GET /_tasks?detailed=true&actions=indices:*write*&nodes=<hot_node_id_or_name>

Review tasks with the longest running_time_in_nanos (and cancellable=true where applicable) to decide whether to tune workload or cancel non-critical tasks.

4) Broaden the phrasing under “Recommendations” to avoid implying only two remediation options

Current sentence:

After identifying problematic threads and tasks, resolve the issue by increasing resources or canceling tasks.

This reads as if there are only two levers (“add resources” or “cancel”), while now there are more (prior to the modifications, there were only those 2). Suggested replacement:

After identifying problematic threads and tasks, address the underlying cause using the recommendations below.

kilfoyle · 2026-01-16T14:25:29Z

@rodrigomadalozzo your suggestions look really good to me. If you don't mind adding them into the PR I'll be happy to re-review / approve. Thanks!

stefnestor added 2 commits January 14, 2026 17:47

Distinguish+expand Hot Spotting from Task Backlog

e514965

typo

f5f9785

stefnestor requested a review from a team as a code owner January 15, 2026 00:51

github-actions bot deployed to docs-preview January 15, 2026 00:52 View deployment

stefnestor added enhancement New feature or request supportability ability enable self-service or support of product labels Jan 15, 2026

kilfoyle self-assigned this Jan 15, 2026