New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pod Scheduling Readiness #3521
Comments
/sig scheduling |
/label lead-opted-in |
/milestone v1.26 |
Hey @kerthcet 👋, 1.26 Enhancements team here! Just checking in as we approach Enhancements Freeze on 18:00 PDT on Thursday 6th October 2022. This enhancement is targeting for stage Here's where this enhancement currently stands:
For this KEP, we would need to:
The status of this enhancement is marked as |
Thanks @Atharva-Shinde.
|
Hello @Huang-Wei 👋, just a quick check-in again, as we approach the 1.26 Enhancements freeze. Please plan to get the PR #3522 merged before Enhancements freeze on 18:00 PDT on Thursday 6th October 2022 i.e tomorrow For note, the current status of the enhancement is marked |
Thanks for the reminder. It's 99% accomplished atm, just some final comments waiting for the approver to +1. |
With #3522 merged, we have this marked as |
Hi @Huang-Wei 👋, Checking in once more as we approach 1.26 code freeze at 17:00 PDT on Tuesday 8th November 2022. Please ensure the following items are completed:
For this enhancement, it looks like the following PRs are open and need to be merged before code freeze:
Please ensure all of these PRs are linked to this issue as well and mentioned in the initial issue description. As always, we are here to help should questions come up. Thanks! |
@marosset yes, those 3 PRs are code implementation for this KEP in alpha stage. I just updated the issue description to get them linked. |
Hello @Huang-Wei 👋, 1.26 Release Docs Lead here. This enhancement is marked as ‘Needs Docs’ for 1.26 release. Please follow the steps detailed in the documentation to open a PR against Any doubt, reach us! Thank you! |
Hi @Huang-Wei 👋, Checking in once more as we approach 1.26 code freeze at 17:00 PDT on Tuesday 8th November 2022. Please ensure the following items are completed:
For this enhancement, it looks like the following PRs are open and need to be merged before code freeze:
As always, we are here to help should questions come up. Thanks! |
@marosset ACK, I'm working with reviewers to get 2 pending PRs merged by tomorrow. |
All dev work has been merged. |
For people wanting to request a specific node but still use the scheduling lifecycle / scheduling gates, etc, they should do what the daemonset controller does, and use nodeAffinity to target a single node without setting spec.nodeName spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchFields:
- key: metadata.name
operator: In
values:
- <node-name-here> |
.spec.nodeName was historically designed as a terminal state of pod's destination to run. You may find quite some enforcements building atop it:
So, to adhere to existing semantics, which we cannot change basically, like Jordan said, express the desire of "placing a pod to node X" via nodeAffinity/nodeSelector rather than setting .spec.nodeName. |
RBAC didn't exist until Kube 1.6, spec.nodeName was there from the start :-) |
Nice summary, just one follow-up for my own education:
Why is this better than setting it directly (for example, via a patch)? |
I (and all of us, I believe) fully agree. That's a bad, unrecommended and confusing field that would have never been able to get in today and it is there only for historical reasons.
With this statement I don't fully agree. Furthermore, it's not accurate that is means "this pod is in node X". For example, if I try to run a pod with nodeName set but that asks for more resources than available on the node, I'll see this:
What actually happened here?
Would you also say that for standard Resource Quota?
Letting the user say "I want to bypass external quota mechanism" is absolutely against the concept of quota in the first place. The whole idea is forcing users under a certain namespace to comply to the quota. Effectively this means that external quota mechanisms have to work that out, like we do in the AAQ operator. In my opinion this is a major issue since one of the main goals of this KEP is to support such external mechanisms. And again, the user saying "I want to bypass quota" is absolutely unacceptable with the standard Resource Quota, so I feel it's a bit unfair to apply this rule only for external mechanisms. I guess that my perspective can be boiled to: nodeName is either a valid field in spec that's acceptable to use by users to specify their pods' desired state or it is an internal field that should be never used by users. If it is a valid spec field - we should treat setting it as a completely valid use-case that should be supported with both standard and external k8s mechanisms. If it's an internal field that should never be set by users - we should deprecate it and disallow its usage. Saying "it's a valid spec field, but users should never use it, therefore it's fine to not support it, but we won't deprecate it because we don't want to break it" seems contradicting to me. |
Thanks @liggitt. For my education: is there a user-case where someone would need to use nodeName instead of node-affinity? Or is it true that in 100% of the times it's better to set node affinity instead of nodeName? |
As soon as a pod with this field is created, it appears in the corresponding node's pod watch (filtered to As you point out, the node can still reject it or terminate it, but the pod is being handled by the node at that point, for better or worse.
If they are writing something that is intended to roll pod creation and scheduling into a single step, and are acting as both pod creator and scheduler, setting spec.nodeName on create is logically coherent. Anyone intending to use the normal scheduling flow should not set spec.nodeName.
"Internal" is a blurry line for an API-driven system. Is a custom scheduler "internal"? Is a custom create-and-schedule-in-a-single-step integration "internal"? We will not break the current use of the field. We can improve documentation about it. |
@pohly by setting .spec.nodeName, it's complex to be guarded against (you have to build admission control or leverage OPA); while via a separate API endpoint (v1.Binding), it's easier to be guarded by RBAC. |
So the goal was to make setting |
Please try to look at this from a user perspective. Setting
Is there an example of a practical case when this is needed? I still don't see why nodeAffinity is not always the right choice in practice. |
There is another option: spec:
nodeSelector:
kubernetes.io/hostname: <node-hostname> |
To oversimplify: only use
From a user perspective, it is the right choice. Unless you are an administrator trying to run a pod in an emergency, for example, if the scheduler is down. |
this label can be modified |
I'm trying to understand if there's an actual use-case for non-schedulers to use this field. I'm not sure I fully understand why we'd want to keep this field around. Do we want to grant users the possibility of bypassing the scheduler, especially if there is never a valuable use-case that justifies it? The following crazy idea pops into my head:
If there's a must-have concrete reason (I haven't seen one yet) to let users to bypass the scheduler, we can introduce a field with a clearer name like Just a crazy idea :) |
An integration that both created and scheduled the pod would set spec.nodeName directly, instead of creating the pod and immediately calling pods/binding to set spec.nodeName. Translating spec.nodeName on create into affinity would break that integration.
There are lots of readers of the field, so we would never remove it. We continue to allow writing it on pod create for compatibility with existing integrations that set it on create.
Requiring a new field to be set to keep existing behavior is just as breaking for compatibility :-/ |
Again, oversimplifying, there is no use-case. |
Yes, it can, but you might be breaking a few first-party and third-party controllers that assume that this label matches the nodeName or at least that it is unique. The label is documented as well-known, so it should be treated with care https://kubernetes.io/docs/reference/labels-annotations-taints/#kubernetesiohostname |
A good example of how …
status:
phase: Failed
…
message: 'Pod was rejected: Node didn''t have enough resource: cpu, requested: 400000000, used: 400038007, capacity: 159500'
reason: OutOfcpu
…
containerStatuses:
- name: nginx
state:
terminated:
exitCode: 137
reason: ContainerStatusUnknown
message: The container could not be located when the pod was terminated
…
image: 'nginx:1.14.2'
started: false To some extent I do not know the current state, but I do wonder - if we are not already doing this today - a Pod with scheduling gates and spec.nodeName should be rejected at admission time. |
@fabiand Yes, it will be rejected at the admission. |
I share that it's a general problem, but due to the special handling of
I do fear that - in your example - kubectl debug or oc debug should change and use affinity instead. The core problem is that kubelet starts to react once nodeName is set. Was it considered to change kubelet to only start acting once nodeName is set and schedulingGates is empty? |
According to the version skew policy, the change would have to be in the kubelet for 3 versions before we can relax the validation in apiserver. I guess that could be backwards compatible if we start in 1.31 and we allow scheduling_gates + nodeName in 1.34.
IMO, that falls under the special case where it might make sense to skip scheduler or an external quota system. You probably wouldn't even want to set requests in a debug pod. |
FWIW - I do wonder if debug pods should actually be guaranteed. I had a couple of cases where debug pods (as best effort) got killed quickly on busy nodes. |
that seems like making the problem and confusion around use of spec.nodeName worse to me... I don't see a compelling reason to do that |
TBH, I still try to understand how skipping the scheduler is ever helpful (when you're not using a custom scheduler).
While this might be correct, the question to me is who makes the decision. Granting a user a knob to skip quota mechanisms feels to me like granting a linux user to bypass permission checks when writing to a file. In both cases the whole idea is to restrict the users and enforce them to comply to a certain policy. Handing the user the possibility to bypass such mechanisms seems entirely contradicting to me and de-facto it makes external quota mechanisms unpractical.
Are you open to discussion on that? This way we can avoid breaking backward compatibility, support external quota mechanisms and extend scheduling-gates in a consistent matter, which IMHO makes the exceptional nodeName case to be less exceptional. |
Not to me... expecting pods which are already assigned to a node to run through scheduling gate removal phases (which couldn't do anything to affect the selected node) seems more confusing than the current behavior which simply forbids that combination. I don't think we should relax validation there and increase confusion. |
I (sadly) concur - If the |
Fun fact: I created a deployment with pod template that had IOW: I wonder if this |
The reason Although it might be achievable with tolerations as well. |
Enhancement Description
schedulingGates
field to the Pod spec that marks a Pod's scheduling readiness.Alpha
scheduler_pending_pods
metric kubernetes#113946Beta
Stable
Please keep this description up to date. This will help the Enhancement Team to track the evolution of the enhancement efficiently.
The text was updated successfully, but these errors were encountered: