Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Per-plugin callback functions for accurate requeueing in kube-scheduler #4247

Open
4 of 5 tasks
sanposhiho opened this issue Sep 27, 2023 · 31 comments
Open
4 of 5 tasks
Assignees
Labels
sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling.

Comments

@sanposhiho
Copy link
Member

sanposhiho commented Sep 27, 2023

Enhancement Description

Please keep this description up to date. This will help the Enhancement Team to track the evolution of the enhancement efficiently.

@k8s-ci-robot k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Sep 27, 2023
@sanposhiho
Copy link
Member Author

sanposhiho commented Sep 27, 2023

/sig scheduling

It was suggested we have a small KEP for QueueingHint. It's kind of a special case though, we can assume DRA is the parent KEP and this KEP stems from it. And I set the alpha version v1.26 which is the same as DRA KEP (or maybe we can just leave it as n/a), and the beta version v1.28 which we actually implemented it and enable it via the beta feature flag (enabled by default).

you are essentially splitting one KEP into two. So there is no grade-skipping as such as the grades were part of the original KEP. So we do know when the feature was alpha and when it went into beta from alpha etc. So please go ahead with a smaller KEP for SchedulerQueueingHints and use the dates from before.
https://kubernetes.slack.com/archives/C5P3FE08M/p1695639140018139?thread_ts=1694167948.846139&cid=C5P3FE08M

@kubernetes/sig-scheduling-leads Can anyone give this PR lead-opted-in?

@k8s-ci-robot k8s-ci-robot added sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Sep 27, 2023
@alculquicondor
Copy link
Member

Do you have a PR for this already?

@sanposhiho
Copy link
Member Author

Not yet. Will be ready probably within this weekend.

@sanposhiho
Copy link
Member Author

Here it is: #4256

@Huang-Wei
Copy link
Member

/label lead-opted-in

@k8s-ci-robot k8s-ci-robot added the lead-opted-in Denotes that an issue has been opted in to a release label Oct 2, 2023
@rayandas
Copy link
Member

rayandas commented Oct 4, 2023

Hello @sanposhiho 👋, v1.29 Enhancements team here.

Just checking in as we approach enhancements freeze on 01:00 UTC, Friday, 6th October, 2023.

This enhancement is targeting for stage beta for v1.29 (correct me, if otherwise)

Here's where this enhancement currently stands:

  • KEP readme using the latest template has been merged into the k/enhancements repo.
  • KEP status is marked as implementable for latest-milestone: 1.29.
  • KEP readme has a updated detailed test plan section filled out
  • KEP readme has up to date graduation criteria
  • KEP has a production readiness review that has been completed and merged into k/enhancements.

For this KEP, we would just need to update the following:

The status of this enhancement is marked as at risk for enhancement freeze. Please keep the issue description up-to-date with appropriate stages as well. Thank you!

@npolshakova npolshakova added this to the v1.29 milestone Oct 4, 2023
@npolshakova
Copy link

Hello 👋, 1.29 Enhancements Lead here.
Unfortunately, this enhancement did not meet requirements for v1.29 enhancements freeze.
Feel free to file an exception to add this back to the release tracking process. Thanks!

/milestone clear

@k8s-ci-robot k8s-ci-robot removed this from the v1.29 milestone Oct 6, 2023
@sanposhiho sanposhiho changed the title Per-plugin callback functions for accurate enqueueing in kube-scheduler Per-plugin callback functions for accurate requeueing in kube-scheduler Oct 6, 2023
@npolshakova
Copy link

Hey again 👋
As #4256 was merged by the additional time approved in the exception request, I am adding this back to v1.29 milestone and changing the status of this enhancement to tracked for enhancement freeze 🚀

/milestone v1.29

@katcosgrove
Copy link

Hey there @sanposhiho 👋, v1.29 Docs Lead here.
Does this enhancement work planned for v1.29 require any new docs or modification to existing docs?
If so, please follows the steps here to open a PR against dev-1.29 branch in the k/website repo. This PR can be just a placeholder at this time and must be created before Thursday, 19 October 2023.
Also, take a look at Documenting for a release to get yourself familiarize with the docs requirement for the release.
Thank you!

@sanposhiho
Copy link
Member Author

sanposhiho commented Oct 11, 2023

@katcosgrove

So, I submitted the PR for modifying the doc in:
kubernetes/website#43427
(Given this KEP is in a unique situation, where it's targeting beta in v1.28 (not 1.29), we can merge the PR now, not for dev-1.29.)

@alculquicondor Do you think we need to have a dedicated page or modify some existing pages for QueueingHint? (Or it's OK not to have a doc for QueueingHint as it's internal?)

@sanposhiho
Copy link
Member Author

I'd like to have a blog post for this enhancement.
It'd be an interesting one since there hasn't been many official doc/blog talking about inside of the scheduling queue + requeueing mechanism.

@alculquicondor
Copy link
Member

You could also consider a blogpost under https://www.kubernetes.dev/blog

@sanposhiho
Copy link
Member Author

Any difference between https://kubernetes.io/blog/ and https://www.kubernetes.dev/blog by the way?
Probably the former is for users, thus the posts are supposed to be understandable for those who don't know much about kubernetes internal. And, the latter is for contributors thus the posts could dive into the details of the implementation. Is this understanding correct?

@alculquicondor
Copy link
Member

That is correct :)

@rayandas
Copy link
Member

Hey again @sanposhiho 👋, 1.29 Enhancements team here,

Just checking in as we approach code freeze at 01:00 UTC Wednesday 1st November 2023: .

Here's where this enhancement currently stands:

  • All PRs to the Kubernetes repo that are related to your enhancement are linked in the above issue description (for tracking purposes).

  • All PR/s are ready to be merged (they have approved and lgtm labels applied) by the code freeze deadline. This includes tests.

For this enhancement, it looks like the following PRs has merged and update in the Issue description:

The status of this KEP is tracked for code freeze. 🚀

Also, please let me know if there are other PRs in k/k we should be tracking for this KEP.
As always, we are here to help if any questions come up. Thanks!

@kcmartin
Copy link

Hi @sanposhiho ! 👋 from the v1.29 Release Team-Communications! We would like to check if you have any plans to publish a blog for this KEP regarding new features, removals, and deprecations for this release. It seems from the comment above that this may be the case, please confirm.

If so, you need to open a PR placeholder in the website repository.
The deadline will be on Tuesday 14th November 2023 (after the Docs deadline PR ready for review)

Here's the 1.29 Calendar

@sanposhiho
Copy link
Member Author

@kcmartin

Hi, yes, here's the placeholder PR. (empty for now)
kubernetes/website#43686

@sanposhiho
Copy link
Member Author

sanposhiho commented Dec 1, 2023

(just noticed I forgot to assign it to me)

/assign

@alculquicondor
Copy link
Member

alculquicondor commented Dec 13, 2023

@sanposhiho since the feature was disabled, please update the KEP with notes on what criteria needs to be fulfilled to re-enable the feature. I think the criteria should be roughly:

  • Increase integration coverage for scheduling requiring retries.
  • All plugins must implement the hints, so that we can disable the preChecks when the feature is enabled.
  • Reduce storage of Pod update events and potentially other memory usage improvements scheduler: handle in-flight Pods with less memory kubernetes#120622
  • Proof via clusterloader load test that there isn't a significant memory degradation.

As a side note, here is the perf dashboard for memory usage https://perf-dash.k8s.io/#/?jobname=gce-5000Nodes&metriccategoryname=E2E&metricname=LoadResources&PodName=kube-scheduler-gce-scale-cluster-master%2Fkube-scheduler&Resource=memory. The test hasn't run since the feature was disabled, so I'm not sure if we will see a memory drop. If we do, then not seeing an increase when the feature is re-enabled would be a good signal. Otherwise, we might need to improve coverage in the load test to incorporate cases with retries.

@alculquicondor
Copy link
Member

It doesn't look like there's an effect in the memory usage according to the dashboard.

@salehsedghpour
Copy link
Contributor

/remove-label lead-opted-in

@k8s-ci-robot k8s-ci-robot removed the lead-opted-in Denotes that an issue has been opted in to a release label Jan 6, 2024
@alculquicondor
Copy link
Member

@sanposhiho I believe you want to target this release?

@sanposhiho
Copy link
Member Author

@alculquicondor Yes, let's aim at making it in this release.
All blockers are tracked in:
kubernetes/kubernetes#122597

@alculquicondor
Copy link
Member

@Huang-Wei could you add the lead-opted-in label?

@sanposhiho don't forget to send an update to the KEP with the target version.

@salehsedghpour
Copy link
Contributor

salehsedghpour commented Jan 16, 2024

As I'm closing the previous milestone, shall we add milestone v1.30?

@salehsedghpour
Copy link
Contributor

/milestone clear

@k8s-ci-robot k8s-ci-robot removed this from the v1.29 milestone Jan 16, 2024
@alculquicondor
Copy link
Member

yes please, add it

@sanposhiho
Copy link
Member Author

@Huang-Wei Can you add required labels to this one too, please?

@alculquicondor
Copy link
Member

#4451 (comment)

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 30, 2024
@alculquicondor
Copy link
Member

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling.
Projects
Status: Tracked for Code Freeze
Status: Backlog
Development

No branches or pull requests

10 participants