-
Notifications
You must be signed in to change notification settings - Fork 21
Feat: tree listing denormalization #1562
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
a07b7e7 to
de536f8
Compare
de536f8 to
dc4f725
Compare
| git_repository_branch, | ||
| git_commit_hash, | ||
| git_commit_name, | ||
| git_commit_tags, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't be more 1:1 to the frontend? like, only one commit tag instead of an array
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm basing off of the treeView response, which returns the tags as an array (btw a list[list[str]] which I think is a mistake), and the frontend is the one to only select the first one I think
bc8ee59 to
2fa5dc4
Compare
bf77a51 to
c1214ef
Compare
backend/kernelCI_app/management/commands/helpers/aggregation_helpers.py
Outdated
Show resolved
Hide resolved
9a1d7a5 to
28cd6a5
Compare
backend/kernelCI_app/management/commands/helpers/aggregation_helpers.py
Outdated
Show resolved
Hide resolved
28cd6a5 to
b4b8876
Compare
backend/kernelCI_app/management/commands/process_pending_aggregations.py
Outdated
Show resolved
Hide resolved
| builds_by_id: dict[str, Builds], | ||
| test_builds_by_id: dict[str, Builds], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are you doing this rename?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be easier to differentiate since these are only builds related to the tests being passed. In the case of treeDetails, we use general builds, not just ones with tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, thanks (and worth writing in the function docs since you're already enhancing it)
backend/kernelCI_app/management/commands/process_pending_aggregations.py
Outdated
Show resolved
Hide resolved
| test_pass, test_failed, test_inc | ||
| ) | ||
| VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s) | ||
| ON CONFLICT (origin, tree_name, git_repository_url, git_repository_branch) DO UPDATE SET |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about git_commit_hash, git_commit_name, git_commit_tags? keep the original? update?
it would be nice to comment about those we left out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MarceloRobert i think we also need to update the other git_* values, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
true, we just don't need to update the fields used on on conflict, but I missed those other ones
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gustavobtflores in such case, shouldn't the query for the hardware status also update the fields other than just the counts? check it in the current _process_hardware_status method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, these fields don't need to be updated even in the tree query because the treeListing table is supposed to only contain the latest checkouts, so this query for updating the counts will only update the counts of existing checkouts, meaning that the values for commit_hash, commit_tags and others won't change. Same for the hardware query
| ready_tests, | ||
| ready_builds, | ||
| t0 = time.time() | ||
| with connections["default"].cursor() as cursor: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why change from connection to connections['default'], it's inconsistent across this file :-(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because django says that connection is for backwards compatibility, and to prefer using connections['default']
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reverted to connection in order to be the same as the other one used in the file
| ready_builds.append(pending_build) | ||
| build_checkouts_by_id[pending_build.build_id] = checkout | ||
|
|
||
| if not ready_builds: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this clause correct? if we do have ready_builds (the missing "else" clause) then we return None? it will crash the caller 👀 (since it will unpack the tuple, which is not a tuple but None)
| hardware_status_data, new_processed_entries_hardware = ( | ||
| aggregate_hardware_status(ready_tests, test_builds_by_id) | ||
| ) | ||
| self._process_hardware_status( | ||
| hardware_status_data, | ||
| new_processed_entries_hardware, | ||
| ) | ||
| self._process_new_processed_entries(new_processed_entries_hardware) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move these into another function, so whenever it returns hardware_status_data, new_processed_entries_hardware will be garbage collected.
Same for the block below with tree_listing_data, new_processed_entries_tree... otherwise all of these objects will be kept in memory (may cause memory peak)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
otherwise all of these objects will be kept in memory (may cause memory peak)
wouldn't they be overwritten in the next loop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, but just in the next loop iteration, keeping stuff alive when they are not needed may cause the memory peak, if you reduce context and GC runs once the smaller function returns, the objects related to processing hardware and then tree listing will be gone sooner
| ( | ||
| ready_builds, | ||
| build_checkouts_by_id, | ||
| last_processed_build_id, | ||
| skipped_no_checkout, | ||
| ) = self._get_ready_builds( | ||
| last_processed_build_id=last_processed_build_id, | ||
| batch_size=batch_size, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since this is only used by _aggregate_tree_listing, whenever you reorganize into smaller functions, move this block + _delete_ready_builds inside it...
that is:
- here in this function you call
_get_ready_tests()since it's shared. - call
self._process_hardware_status(ready_tests, test_builds_by_id), internally it will callaggregate_hardware_status()andself._process_new_processed_entries(new_processed_entries_hardware) - call
self._process_tree_listing(ready_tests, test_builds_by_id), internally it will call_get_ready_builds(),aggregate_tree_listing(),self._process_new_processed_entries(new_processed_entries_tree)and thenself._delete_ready_builds(ready_builds). - this function cleans up
self._delete_ready_tests(ready_tests)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since _get_ready_builds uses last_processed_build_id and skipped_no_checkout, which are variables declared/used within this loop, I think that they could stay inside the loop as well. The similarity between processing the tests for hardware and builds for trees also makes it easier to group the functions and understand them IMO; I made some small changes as you proposed, not much.
| ready_tests.append(pending_test) | ||
| test_builds_by_id[build.id] = build | ||
|
|
||
| if not ready_tests: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto, is this right?
0cf02d7 to
b57387a
Compare
| test_pass, test_failed, test_inc | ||
| ) | ||
| VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s) | ||
| ON CONFLICT (origin, tree_name, git_repository_url, git_repository_branch) DO UPDATE SET |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MarceloRobert i think we also need to update the other git_* values, no?
backend/kernelCI_app/management/commands/process_pending_aggregations.py
Outdated
Show resolved
Hide resolved
backend/kernelCI_app/management/commands/process_pending_aggregations.py
Outdated
Show resolved
Hide resolved
backend/kernelCI_app/management/commands/process_pending_aggregations.py
Outdated
Show resolved
Hide resolved
backend/kernelCI_app/management/commands/process_pending_aggregations.py
Outdated
Show resolved
Hide resolved
backend/kernelCI_app/management/commands/process_pending_aggregations.py
Outdated
Show resolved
Hide resolved
53b4a75 to
3e4a505
Compare
Part of the changes for the treeListing denormalization
5be6506 to
bfcdbf4
Compare
backend/kernelCI_app/management/commands/process_pending_aggregations.py
Outdated
Show resolved
Hide resolved
| # If we already processed an item, but the previous status is null and the new one is not-null, | ||
| # we need to process it again. That's why we store the status here. | ||
| status = models.CharField( | ||
| max_length=1, choices=SimplifiedStatusChoices.choices, null=True | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we aren't using it yet, right? maybe remove this status from the table and comeback to it when we decide to really solve the issue with the null -> non-null status thing
| if to_process in existing_processed: | ||
| return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think here we have the culprit on why the countings in the tree listing are incorrect, we could check if the existing_processed has a different status and then count a -1 or +1 to the tree that was affected by the status change
bfcdbf4 to
465a602
Compare
In case some submissions are sent twice, one might have null status and the next one might have the correct status. Since the usual table coalesces the values, so should do the new one.
The hardware listing only considers tests that have a platform defined, but for the treeListing we should also allow tests that have no platform to go through the PendingTest table
465a602 to
366480f
Compare
Changes
Benefits
Testing with a direct call to the old and new tree listing endpoints, the average of 10 calls was:
This represents a 13.56x improvement in performance with a small amount of data. Since the query for the new listing is a single
SELECTstatement, the response time won't grow as fast as the old query when more data is present, meaning that there will be an even better ratio. The tests were made with 50000 real-world submissions locally.How it works
When a new checkout is received, it adds the tree to treeListing with fresh numbers. If that tree already exists and the checkout is newer, the tree will be updated and all counts will reset to 0; if the checkout is older then it does nothing. It will show the count as 0 until the pending tests and builds are processed. Showing 0 counts is useful in order to see if a checkout worked but the builds/tests weren't send for some reason, and such problem should be seen and addressed as soon as possible.
The processing of new builds follows what has been done for the hardwareListing. When a new build is received it updates the
PendingBuildstable, and when theprocess_pending_aggregationscommand is executed it will count the pending build's status on a single treeListing item, then this item will update the treeListing table only for the checkouts that are already existing (so that builds of older checkouts don't update the counts of newer checkouts for a tree).How to test
backend/kernelCI_app/management/commands/tmp_submissions(you can change the path, just update the command below)/backend/volume_dataas a result of thetreeproofcommand. If you are feeling lazy and you don't want to run the command, there's an example file here (Google Drive link)USE_DASHBOARD_DB=Truepoetry run python3 ./manage.py monitor_submissions --spool-dir kernelCI_app/management/commands/tmp_submissions --trees-file volume_data/trees-name.yaml(being in the/backenddirectory, update the paths if necessary)You can also test the ingester on docker, make sure the environment variables are on
.env.backendand usedocker compose run --rm backendbefore running the command shown above.Closes #1558