Description
What happened?
This issue is to summarize some conversation towards the end of #7890, as requested by @thockin.
If an application in a privileged pod or container creates a FUSE mount in an emptyDir volume, but fails to unmount it before terminating (either due to that being a conscious choice by the application, or due to a SIGKILL from kubernetes), the kubelet will fail to clean up the pod. A recurring error will appear in the kubelet logs, and the pod will remain in API.
Here is an example error log from the kubelet during cleanup:
Jan 08 19:06:04 <hostname omitted> kubelet[12511]: E0108 19:06:04.507950 12511 nestedpendingoperations.go:348] Operation for "{volumeName:kubernetes.io/empty-dir/30b506e8-b18a-4d5c-bf7d-17fbae54a5d0-worker podName:30b506e8-b18a-4d5c-bf7d-17fbae54a5d0 nodeName:}" failed. No retries permitted until 2025-01-08 19:08:06.507933266 +0000 UTC m=+1437.970062341 (durationBeforeRetry 2m2s). Error: UnmountVolume.TearDown failed for volume "worker" (UniqueName: "kubernetes.io/empty-dir/30b506e8-b18a-4d5c-bf7d-17fbae54a5d0-worker") pod "30b506e8-b18a-4d5c-bf7d-17fbae54a5d0" (UID: "30b506e8-b18a-4d5c-bf7d-17fbae54a5d0") : openfdat /var/lib/kubelet/pods/30b506e8-b18a-4d5c-bf7d-17fbae54a5d0/volumes/kubernetes.io~empty-dir/worker/build: transport endpoint is not connected
The offending code seems to be here: https://github.com/kubernetes/kubernetes/blob/release-1.31/pkg/volume/emptydir/empty_dir.go#L490-L495
When cleaning up emptyDirs, we start with os.RemoveAll
, as it's recursing through the directory, it will eventually try to inspect the contents of the FUSE mount, which will result in an error.
What did you expect to happen?
The kubelet is able to clean the pod up eventually.
How can we reproduce it (as minimally and precisely as possible)?
- Run a privileged container that can generate a FUSE mount within an emptydir
- Configure the application to not clean the FUSE mount, or forcefully terminate the pod so the mount cannot be cleaned.
Anything else we need to know?
I'd be happy to try and put a patch together to address this with a little guidance. It seems like we should be able to inspect for any mounts beneath the empty directory's MetaDir
and umount
them before we attempt to call the os.RemoveAll
.