Patrick Ohly [Fri, 26 Dec 2025 09:37:10 +0000 (10:37 +0100)]
hack/verify-featuregates.sh: print failure information to stderr
Verify scripts are run such that stderr is captured and included in the JUnit
files. Stdout is not. Therefore the instructions in case of a failure where
only visible by searching the entire job log file, but not in the Prow summary.
Patrick Ohly [Thu, 25 Dec 2025 18:44:24 +0000 (19:44 +0100)]
DRA ExtendedResourceCache: avoid risk of flakes
The unit test flaked at least once so far in the CI. Can be reproduced
locally by running many instances of the test in parallel.
There were two potential root causes:
- The watch must be set up completely before creating objects (a fake
client-go limitation).
- The one second timeout might be too small on a loaded system.
By running the tests in a synctest bubble with synctest.Wait calls at
the right places both problems are avoided. As a welcome side effect
the test also completes faster.
Before:
$ go test k8s.io/dynamic-resource-allocation/deviceclass/extendedresourcecache
ok k8s.io/dynamic-resource-allocation/deviceclass/extendedresourcecache 9.300s
$ stress -p 512 ./extendedresourcecache.test
5s: 0 runs so far, 0 failures, 512 active
10s: 8 runs so far, 0 failures, 512 active
/tmp/go-stress-20251225T195231-2875181701
--- FAIL: TestExtendedResourceCache (9.14s)
extendedresourcecache_test.go:234: Expected to find device class 'gpu-class' for 'example.com/gpu', got nil
extendedresourcecache_test.go:241: Expected to find device class 'fpga-class' for 'deviceclass.resource.kubernetes.io/fpga-class', got nil
extendedresourcecache_test.go:247: Expected default mapping for gpu-class
extendedresourcecache.go:197: I1225 19:52:36.332170] Updated extended resource cache for explicit mapping extendedResource="example.com/gpu" deviceClass="gpu-class-3"
extendedresourcecache.go:204: I1225 19:52:36.332216] Updated extended resource cache for default mapping extendedResource="deviceclass.resource.kubernetes.io/gpu-class-3" deviceClass="gpu-class-3"
extendedresourcecache.go:220: I1225 19:52:36.332245] Updated device class mapping deviceClass="gpu-class-3" extendedResource="example.com/gpu"
extendedresourcecache_test.go:260: Expected to find device class 'gpu-class' for 'example.com/gpu', got &DeviceClass{ObjectMeta:{gpu-class-3 0 2025-12-24 19:52:35.32560833 +0100 CET m=-86396.691266715 <nil> <nil> map[] map[] [] [] [{unknown Update resource.k8s.io/v1 2025-12-25 18:52:36.332067439 +0000 UTC FieldsV1 {"f:spec":{"f:extendedResourceName":{}}} }]},Spec:DeviceClassSpec{Selectors:[]DeviceSelector{},Config:[]DeviceClassConfiguration{},ExtendedResourceName:*example.com/gpu,},}
extendedresourcecache.go:197: I1225 19:52:37.336135] Updated extended resource cache for explicit mapping extendedResource="example.com/gpu" deviceClass="gpu-class-4"
extendedresourcecache.go:204: I1225 19:52:37.336169] Updated extended resource cache for default mapping extendedResource="deviceclass.resource.kubernetes.io/gpu-class-4" deviceClass="gpu-class-4"
extendedresourcecache.go:220: I1225 19:52:37.336192] Updated device class mapping deviceClass="gpu-class-4" extendedResource="example.com/gpu"
extendedresourcecache.go:197: I1225 19:52:38.340121] Updated extended resource cache for explicit mapping extendedResource="example
After:
ok k8s.io/dynamic-resource-allocation/deviceclass/extendedresourcecache 0.064s
...
2m0s: 7063 runs so far, 0 failures, 512 active
Walter Fender [Sat, 20 Dec 2025 00:43:04 +0000 (00:43 +0000)]
Update KAS apiserver network proxy to v0.34
Update konnectivity network proxy to v0.34.0. Includes bug fixes such as memory-leak in http-connect mode, stale count fix and updates to match/support kubernetes version 1.34
(https://github.com/kubernetes-sigs/apiserver-network-proxy/commits/v0.34.0)
Davanum Srinivas [Sun, 21 Dec 2025 03:40:27 +0000 (22:40 -0500)]
Remove orphaned build/nsswitch.conf
This file was added in 2018 (PR #69238) to ensure Go's netgo DNS
resolver respects /etc/hosts in busybox-based control plane images.
In 2021 (PR #99015), the build switched to on-disk Dockerfiles and
distroless base images. The nsswitch.conf copying was dropped and
the distroless base (Debian-based) already includes /etc/nsswitch.conf.
Manuel Grandeit [Sat, 20 Dec 2025 12:34:01 +0000 (13:34 +0100)]
Fix data race in PriorityQueue.UnschedulablePods()
The UnschedulablePods() function iterates over the unschedulablePods.podInfoMap
without holding any lock, while other goroutines may concurrently modify the map
via addOrUpdate(), delete(), or clear().
Other functions like PendingPods() and GetPod() correctly acquire p.lock.RLock()
before accessing unschedulablePods.podInfoMap, but UnschedulablePods() was
missing this.
Fix by adding p.lock.RLock()/RUnlock() to UnschedulablePods(), matching the
pattern used by PendingPods().
Patrick Ohly [Tue, 16 Dec 2025 13:32:00 +0000 (14:32 +0100)]
dependencies: ginkgo v2.27.3 + gomega v1.38.3
This fixes some issues found in Kubernetes (data race in ginkgo CLI, gomega
formatting) and helps with diagnosing OOM killing in CI jobs (exit status of
processes).
The modified gomega formatting shows up in some of the output tests for the E2E
framework. They get updated accordingly.
hongkang [Sat, 11 Jan 2025 16:10:02 +0000 (00:10 +0800)]
Fix VolumeAttachment cleanup when AttachRequired changes
When CSI's AttachRequired changes from true to false after a successful
volume attach, MarkVolumeAsAttached fails because it attempts to look up
the plugin by spec, which fails verification.
This patch passes the VolumeName directly to MarkVolumeAsAttached.
This allows the function to skip the plugin lookup and correctly mark
the volume as attached in the Actual State of World, ensuring
VolumeAttachment cleanup can proceed.
Patrick Ohly [Mon, 1 Dec 2025 14:54:18 +0000 (15:54 +0100)]
build: remove deprecated '// +build' tag
This has been replaced by `//build:...` for a long time now.
Removal of the old build tag was automated with:
for i in $(git grep -l '^// +build' | grep -v -e '^vendor/'); do if ! grep -q '^// Code generated' "$i"; then sed -i -e '/^\/\/ +build/d' "$i"; fi; done
Patrick Ohly [Thu, 18 Dec 2025 11:06:55 +0000 (12:06 +0100)]
DRA device taints controller: add pohly to OWNERS
While the code is nominally owned by SIG Scheduling, in practice I am the one
who knows it best, so I should be a reviewer and should be able to merge simple
changes without additional approvals (will use cautiously!).