Fix MSQ compaction state and native interval locking, add test coverage#18950
Fix MSQ compaction state and native interval locking, add test coverage#18950cecemei wants to merge 32 commits intoapache:masterfrom
Conversation
| Assert.assertEquals(original.isIncludeAllDimensions(), rebuilt.isIncludeAllDimensions()); | ||
| Assert.assertEquals(original.useSchemaDiscovery(), rebuilt.useSchemaDiscovery()); | ||
| Assert.assertEquals(original.isForceSegmentSortByTimeConfigured(), rebuilt.isForceSegmentSortByTimeConfigured()); | ||
| Assert.assertEquals(original.getSpatialDimensions(), rebuilt.getSpatialDimensions()); |
Check notice
Code scanning / CodeQL
Deprecated method or constructor invocation Note test
| Assert.assertEquals(original.isIncludeAllDimensions(), rebuilt.isIncludeAllDimensions()); | ||
| Assert.assertEquals(original.useSchemaDiscovery(), rebuilt.useSchemaDiscovery()); | ||
| Assert.assertEquals(original.isForceSegmentSortByTimeConfigured(), rebuilt.isForceSegmentSortByTimeConfigured()); | ||
| Assert.assertEquals(original.getSpatialDimensions(), rebuilt.getSpatialDimensions()); |
Check notice
Code scanning / CodeQL
Deprecated method or constructor invocation Note test
capistrant
left a comment
There was a problem hiding this comment.
Thank you for taking the time to extend all the existing testing code to work with MSQ, this is awesome!
Left some comments, I think my main questions/concerns are around the locking change as well as MSQ and dropExisting enforcement
indexing-service/src/main/java/org/apache/druid/indexing/common/task/CompactionTask.java
Outdated
Show resolved
Hide resolved
indexing-service/src/main/java/org/apache/druid/indexing/common/task/CompactionTask.java
Outdated
Show resolved
Hide resolved
| */ | ||
| default boolean forceDropExisting() | ||
| { | ||
| return true; |
There was a problem hiding this comment.
Was this a nasty bug that MSQ compaction didn't enforce dropExisting being true in the past? cuz like I'm sure there are people using it in production and not setting their io config explicitly.. so I guess their tasks would now fail until they set the config? I always thought dropExisting was a native only config since MSQ always creates tombstones regardless of this value. But I guess if REPLACE_LEGACY is used to make any other decisions behind the scenes it could be doing something unexpected?
There was a problem hiding this comment.
MSQ actually never read this config, but instead just run with dropExisting, even though it's set to false (which is also the default) in the config. I thought this would be somewhat confusing if ppl really look into what this field means.
So it's not a real change for compaction runner, the result would still be the same but instead we just want the config to be honest. I was also hesistant about this might break anything, so also considering maybe just revert this change in case it breaks something.
There was a problem hiding this comment.
In my test cluster, my msq compact supervisor does start failing to create tasks after deploying your feature branch. I think this would be a sad change to force people into making.
Caused by: org.apache.druid.java.util.common.IAE: Invalid config: runner[class org.apache.druid.msq.indexing.MSQCompactionRunner] must run with dropExisting
There was a problem hiding this comment.
yea i made the dropExisting change in another pr: https://github.com/apache/druid/pull/18968/changes#diff-280dab320bc1341105272f99d3ec8e7f8626c9d117f7affa933bc1a522bd465aL368, but the check here is probably not all that necessary. anyway i removed this check here so the task wont fail.
| { | ||
| final List<DataSegment> segments = segmentProvider.findSegments(taskActionClient); | ||
| return determineLockGranularityAndTryLockWithSegments(taskActionClient, segments, segmentProvider::checkSegments); | ||
| return determineLockGranularityAndTryLock(taskActionClient, List.of(segmentProvider.interval)); |
There was a problem hiding this comment.
Is this what you refer to in Native runner's index task now locks intervals from CompactionTask instead of existing segments' intervals? Isn't this code on the path for both native and MSQ compaction? that is at least how I interpret it. Not that I think that makes this wrong, but I'm trying to make sure I understand the scope of the change.
There was a problem hiding this comment.
an aside, I think this would make determineLockGranularityAndTryLockWithSegments unused now. do we want to leave that around for future use or remove it?
There was a problem hiding this comment.
Yes this is on both path. This would make a different say compaction is for a day, but maybe we only have segments for 1 hour. This actually broke msq runner since the task lock is not that smart to figure out it's just the same thing, but then i felt this might be just before concurrent append & replace times that we try to lock less, we should probably move things to the concurrent append & replace world.
There was a problem hiding this comment.
removed determineLockGranularityAndTryLockWithSegments
There was a problem hiding this comment.
I am not sure if it safe to remove this. IIUC, the original logic was to fetch all the segments that overlap (even if partially) with the semgentProvider.interval, then try to lock the umbrella interval of the segments. So the final interval locked may be bigger than the original interval. The idea is that a task will be able to replace only those segments which are fully contained in the interval that is locked by that task.
@cecemei , could you share some more details on the issue that you encountered with the MSQ runner and the hour granularity? Maybe there could be an alternative solution.
There was a problem hiding this comment.
ah i was thinking the case when the umbrella interval is smaller than the original interval (which is the case in CompactionTaskRunBase, the segment is hourly and there's only segments in hour 0 - hour 3, so compaction task only locks for 3 hours, but the interval can be all day (see NativeCompactionTaskRunTest). MSQ runner failed when getting a task lock with the following change:
Cannot create a new taskLockPosse for request[TimeChunkLockRequest{lockType=REPLACE, groupId='compact_test_mgmlcmch_2026-02-03T18:37:51.661Z', dataSource='test', interval=2014-01-01T00:00:00.000Z/2014-01-02T00:00:00.000Z, preferredVersion='null', priority=25, revoked=false}] because existing locks[[TaskLockPosse{taskLock=TimeChunkLock{type=REPLACE, groupId='compact_test_mgmlcmch_2026-02-03T18:37:51.661Z', dataSource='test', interval=2014-01-01T00:00:00.000Z/2014-01-01T03:00:00.000Z, version='2026-02-03T18:37:51.673Z', priority=25, revoked=false}, taskIds=[compact_test_mgmlcmch_2026-02-03T18:37:51.661Z]}]] have same or higher priorities
There was a problem hiding this comment.
i'm not sure i quite understand the part replace only those segments which are fully contained in the locked interval? do you mean when append segments get upgrade? for compaction, i think testPartialIntervalCompactWithFinerSegmentGranularity proves that a smaller interval lock works?
...ge-query/src/main/java/org/apache/druid/msq/indexing/destination/SegmentGenerationUtils.java
Show resolved
Hide resolved
...ge-query/src/main/java/org/apache/druid/msq/indexing/destination/SegmentGenerationUtils.java
Show resolved
Hide resolved
…n/task/CompactionTask.java Co-authored-by: Lucas Capistrant <capistrant@users.noreply.github.com>
…n/task/CompactionTask.java Co-authored-by: Lucas Capistrant <capistrant@users.noreply.github.com>
There was a problem hiding this comment.
Thanks for responding to my questions @cecemei
I think this should be ready then. Could you add a release notes section to the PR description that spec's out the two bug fixes you did for the lastCompactionState in MSQ compaction. Those may be of interest to have in druid 37 release notes. I'll leave it up to you if you also think adding a dedicated note about the locking changes.
kfaraz
left a comment
There was a problem hiding this comment.
Left some questions/suggestions based on first glance.
processing/src/main/java/org/apache/druid/timeline/CompactionState.java
Outdated
Show resolved
Hide resolved
processing/src/main/java/org/apache/druid/timeline/CompactionState.java
Outdated
Show resolved
Hide resolved
| public static final Granularity FIFTEEN_MINUTE = GranularityType.FIFTEEN_MINUTE.getDefaultGranularity(); | ||
| public static final Granularity THIRTY_MINUTE = GranularityType.THIRTY_MINUTE.getDefaultGranularity(); | ||
| public static final Granularity HOUR = GranularityType.HOUR.getDefaultGranularity(); | ||
| public static final Granularity THREE_HOUR = GranularityType.THREE_HOUR.getDefaultGranularity(); |
There was a problem hiding this comment.
Why is this new granularity needed?
There was a problem hiding this comment.
added this for the test because compaction task automatically picks the finest gran (hourly) if it's not specified, but msq runner can only process one interval per task (there's 3 hours).
There was a problem hiding this comment.
Let's use one of the existing standard granularities then, like SIX_HOUR or EIGHT_HOUR and update the tests accordingly. Would that work?
There was a problem hiding this comment.
updated to use SIX_HOUR and reverted THREE_HOUR change
indexing-service/src/main/java/org/apache/druid/indexing/common/task/CompactionRunner.java
Outdated
Show resolved
Hide resolved
indexing-service/src/main/java/org/apache/druid/indexing/common/task/CompactionTask.java
Outdated
Show resolved
Hide resolved
| { | ||
| final List<DataSegment> segments = segmentProvider.findSegments(taskActionClient); | ||
| return determineLockGranularityAndTryLockWithSegments(taskActionClient, segments, segmentProvider::checkSegments); | ||
| return determineLockGranularityAndTryLock(taskActionClient, List.of(segmentProvider.interval)); |
There was a problem hiding this comment.
I am not sure if it safe to remove this. IIUC, the original logic was to fetch all the segments that overlap (even if partially) with the semgentProvider.interval, then try to lock the umbrella interval of the segments. So the final interval locked may be bigger than the original interval. The idea is that a task will be able to replace only those segments which are fully contained in the interval that is locked by that task.
@cecemei , could you share some more details on the issue that you encountered with the MSQ runner and the hour granularity? Maybe there could be an alternative solution.
indexing-service/src/main/java/org/apache/druid/indexing/common/task/CompactionTask.java
Show resolved
Hide resolved
Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com>
...-tests/src/test/java/org/apache/druid/testing/embedded/compact/CompactionSupervisorTest.java
Outdated
Show resolved
Hide resolved
indexing-service/src/main/java/org/apache/druid/indexing/common/task/CompactionTask.java
Outdated
Show resolved
Hide resolved
indexing-service/src/main/java/org/apache/druid/indexing/common/task/CompactionTask.java
Outdated
Show resolved
Hide resolved
| * | ||
| * @return true if aligned intervals are required by this runner, false otherwise. | ||
| */ | ||
| boolean requireAlignedInterval(); |
There was a problem hiding this comment.
Nit: I wonder if we ever allow non-aligned intervals even for native compaction.
I don't see that field being set while creating auto-compaction tasks.
I feel adding this new method to the CompactionRunner interface for a field which is never used is overkill.
Let's just continue ignoring this field in the MSQ compaction runner as we do today and just log a warning message in MSQ compaction runner if it is ever set to true.
I think we will just go ahead and deprecate this field and remove it altogether in a future release.
There was a problem hiding this comment.
The docs seem to think so too:
https://druid.apache.org/docs/latest/data-management/manual-compaction/#compaction-io-configuration
There was a problem hiding this comment.
I think we have had the parameter for backwards compat for a while, we can probably deprecate it now since auto-compaction does not use it and MSQ does not use it either.
| return this; | ||
| } | ||
|
|
||
| public TaskActionTestKit setBatchSegmentAllocation(boolean batchSegmentAllocation) |
There was a problem hiding this comment.
Just curious, have we added new tests which need this?
There was a problem hiding this comment.
ah i think i had some issues with batch segment allocation with segment lock, and some pending segments cache was not removed properly or something (maybe i should have written it down), and have decided to disable the batch allocation to avoid this issue
There was a problem hiding this comment.
yea so it's set to false in CompactionTaskRunBase
There was a problem hiding this comment.
Oh, if that is the case, we would need to dig deeper into it since it sounds like a bug and batch segment allocation is the default. It should work for all cases.
There was a problem hiding this comment.
it's only for segment lock, and i wasn't sure it's due to test setup or anything.
There was a problem hiding this comment.
Okay, in that case, let's skip the test case for segment lock and add a comment or add a test case and mark it disabled. That way, we would be aware that there is a bug and can address it later.
There was a problem hiding this comment.
i could not reproduce the test cases any more, maybe it disappeared after we switched to use the widen interval. anyways, so i added the segmentQueue test param.
...service/src/test/java/org/apache/druid/indexing/common/task/NativeCompactionTaskRunTest.java
Outdated
Show resolved
Hide resolved
| return inputSpec; | ||
| } | ||
|
|
||
| @Deprecated |
There was a problem hiding this comment.
Nit: Adding a short javadoc indicating why this is deprecated would be nice.
| } | ||
|
|
||
| @Provides | ||
| IndexerControllerContext.Builder providesContextBuilder(Injector injector) |
There was a problem hiding this comment.
We need not use the Injector directly here.
| IndexerControllerContext.Builder providesContextBuilder(Injector injector) | |
| IndexerControllerContext.Builder getControllerContextBuilder( | |
| @EscalatedGlobal ServiceClientFactory serviceClientFactory, | |
| OverlordClient overlordClient | |
| ) |
There was a problem hiding this comment.
Also, is there any functional change here or is the code just being moved from MSQControllerTask?
There was a problem hiding this comment.
there's no functional change, it's just easier for testing so that we can inject the context in tests
There was a problem hiding this comment.
controller context need the injector... i originally added separate class as well but i think some tests breaks since the guice module asks for class instance even before it asks to provide context instance. it's not ideal but injector is in many places.
| */ | ||
| public class IndexerControllerContext implements ControllerContext | ||
| { | ||
| public interface Builder |
There was a problem hiding this comment.
It might be cleaner to move this into a separate file and call it IndexerControllerContextFactory to align with other similar factory classes in Druid.
| this.overlordClient = overlordClient; | ||
| this.memoryIntrospector = injector.getInstance(MemoryIntrospector.class); | ||
| final StorageConnectorProvider storageConnectorProvider = injector.getInstance(Key.get(StorageConnectorProvider.class, MultiStageQuery.class)); | ||
| final StorageConnectorProvider storageConnectorProvider = injector.getInstance(Key.get( |
There was a problem hiding this comment.
Suggestion: Change doesn't seem necessary. It's best to avoid formatting changes unless they really help with readability.
There was a problem hiding this comment.
this has been reverted
| return this; | ||
| } | ||
|
|
||
| public TaskActionTestKit setBatchSegmentAllocation(boolean batchSegmentAllocation) |
There was a problem hiding this comment.
Okay, in that case, let's skip the test case for segment lock and add a comment or add a test case and mark it disabled. That way, we would be aware that there is a bug and can address it later.
| * <p> | ||
| * Useful for verifying that tasks publish the expected segments and schemas in integration tests. | ||
| */ | ||
| public class TestSpyTaskActionClient implements TaskActionClient |
There was a problem hiding this comment.
This action client is very specific to segment transactions and might not have a wide usage.
I would suggest the following:
- Move this test class to
CompactionTaskRunBaseitself - Rename it to something simpler
TestTaskActionClientorWrappingTaskActionClient. (The "spy" is a little misleading since it suggests some usage of Mockito spy utilities). - Avoid javadocs since this code is unit-test-only and seems self explanatory. Add only 1-line javadocs or regular comments where necessary.
| for (LockGranularity lockGranularity : new LockGranularity[]{LockGranularity.TIME_CHUNK}) { | ||
| for (boolean useCentralizedDatasourceSchema : new boolean[]{false}) { | ||
| for (boolean useSegmentMetadataCache : new boolean[]{false, true}) { | ||
| for (boolean useConcurrentLocks : new boolean[]{false, true}) { | ||
| for (Interval inputInterval : new Interval[]{TEST_INTERVAL}) { | ||
| for (Granularity segmentGran : new Granularity[]{Granularities.SIX_HOUR}) { |
There was a problem hiding this comment.
I think only 2 for loops (for useSegmentMetadataCache and useConcurrentLocks) should be enough here. The other parameters seem to be taking a single value only.
| for (boolean useConcurrentLocks : new boolean[]{false, true}) { | ||
| for (Interval inputInterval : new Interval[]{TEST_INTERVAL}) { | ||
| for (Granularity segmentGran : new Granularity[]{Granularities.SIX_HOUR}) { | ||
| String name = StringUtils.format( |
There was a problem hiding this comment.
This should be done in the @Paramaters annotation itself.
| } | ||
|
|
||
| @Test | ||
| public void testWithSegmentGranularity() throws Exception |
There was a problem hiding this comment.
Is this test not needed anymore?
There was a problem hiding this comment.
segment granularity has been tested very throughly in this test after this change, so we dont need it any more.
| import java.util.stream.Collectors; | ||
|
|
||
| @RunWith(Parameterized.class) | ||
| public class CompactionTaskRunTest extends IngestionTestBase | ||
| public abstract class CompactionTaskRunBase |
There was a problem hiding this comment.
The diff for this class seems big and it is a little difficult to ensure that no assertions have actually changed.
Since we are changing some core logic in this PR, I would advise minimizing the changes to this test class so that we can be certain that the new code works correctly with all existing tests.
If any refactor is needed in this test, we can do it in a follow up PR.
For the time being, if some methods need to be reused for the MSQCompactionTaskRunTest, they may be copied over or we may make them public static where applicable.
There was a problem hiding this comment.
motivation to change this test is that it doesnt cover msq runner, and while re-writing it i find some tests too long to read, and some issues in msq runner have been discovered due to making it run with msq runner. the test coverage has improved a lot in this class, priori to this pr, this test only tests native compaction runner and also dont use a real task action client. i was surprised by how little test coverage we have on msq compaction runner, wanted to use this test file for any msq compaction runner related change in the future.
the changes in this pr actually was mostly not covered by this test (before this pr), only the interval locking is related. i re-ran the old test with the interval lock change, the failures are due to the interval diff in compaction state, and another test failed due to lock interval covers the input interval now. so all seems expected.
| return this; | ||
| } | ||
|
|
||
| public TaskActionTestKit setUseCentralizedDatasourceSchema(boolean useCentralizedDatasourceSchema) |
There was a problem hiding this comment.
it's not supported by msq engine, schema is not saved
|
|
||
| public TaskActionTestKit setBatchSegmentAllocation(boolean batchSegmentAllocation) | ||
| { | ||
| if (configFinalized.get()) { |
There was a problem hiding this comment.
the before method was called before @test, wont that mean we can't change the config between tests, but in setup only?
| } | ||
|
|
||
|
|
||
| public Builder toBuilder() |
There was a problem hiding this comment.
ah i thought this toBuilder() is more fluent style? like:
getDefaultCompactionState().toBuilder().dimensionSpec().build()
| GranularityType.HOUR, | ||
| Intervals.of("2016-06-27T01:00:00.000Z/2016-06-27T02:00:00.000Z") | ||
| Intervals.of("2016-06-27T01:00:00.000Z/2016-06-27T02:00:00.000Z"), | ||
| new CompactionTransformSpec( |
There was a problem hiding this comment.
tests without filters are still the same (without transform spec).
| } else { | ||
| // The changed granularity would result in a new virtual column that needs to be aggregated upon. | ||
| dimensionSpecs.add(new DefaultDimensionSpec(TIME_VIRTUAL_COLUMN, TIME_VIRTUAL_COLUMN, ColumnType.LONG)); | ||
| if (!dataSchema.getDimensionsSpec().getDimensionNames().contains(ColumnHolder.TIME_COLUMN_NAME)) { |
| import java.util.stream.Collectors; | ||
|
|
||
| @RunWith(Parameterized.class) | ||
| public class CompactionTaskRunTest extends IngestionTestBase | ||
| public abstract class CompactionTaskRunBase |
There was a problem hiding this comment.
motivation to change this test is that it doesnt cover msq runner, and while re-writing it i find some tests too long to read, and some issues in msq runner have been discovered due to making it run with msq runner. the test coverage has improved a lot in this class, priori to this pr, this test only tests native compaction runner and also dont use a real task action client. i was surprised by how little test coverage we have on msq compaction runner, wanted to use this test file for any msq compaction runner related change in the future.
the changes in this pr actually was mostly not covered by this test (before this pr), only the interval locking is related. i re-ran the old test with the interval lock change, the failures are due to the interval diff in compaction state, and another test failed due to lock interval covers the input interval now. so all seems expected.
| return inputSpec; | ||
| } | ||
|
|
||
| @Deprecated |
| return this; | ||
| } | ||
|
|
||
| public TaskActionTestKit setBatchSegmentAllocation(boolean batchSegmentAllocation) |
There was a problem hiding this comment.
i could not reproduce the test cases any more, maybe it disappeared after we switched to use the widen interval. anyways, so i added the segmentQueue test param.
|
|
||
| public TaskActionTestKit setBatchSegmentAllocation(boolean batchSegmentAllocation) | ||
| { | ||
| if (configFinalized.get()) { |
There was a problem hiding this comment.
well this is just to prevent ppl from accidently calling it inside test method
| } | ||
|
|
||
| @Test | ||
| public void testWithSegmentGranularity() throws Exception |
There was a problem hiding this comment.
segment granularity has been tested very throughly in this test after this change, so we dont need it any more.
Description
This PR fixes several issues with MSQ compaction state and native compaction interval locking, and adds comprehensive test coverage for MSQ compaction tasks.
Fixed compaction state issues
Fixed native compaction interval locking
Added MSQCompactionTaskRunTest
Refactored test infrastructure
Added builder pattern support
Key changed/added classes in this PR
This PR has: