Resilient storage: OpenEBS target on workload node

Release where feature is introduced: 0.8

OpenEBS is known for resiliency and availability. Whereas, openebs enables availability using replication of volumes there are certain limits to this strategy. One of the major drawbacks of using a cloud native storage is that along with advantages of cloud environment the not so likeable features are inherited.

Recently, we came across this issue where one of the users was using openebs as persistent volume for their workload. For some reason one of the nodes crashed, one of the target pod of a volume was scheduled on this node. Therefore, the workload using this volume could not do read/write.

This caused a single point of failure i.e. the target disrupted the entire volume. So, we came up with a strategy which would bind target failure with workload node rather than a random node.

This is a simple strategy where we schedule the target pod on the same node where the workload is scheduled. Therefore, the unavailability of a target due to node failure would be clubbed to the workload.

Let us see how this would help a workload:

If node 2 goes down in first case then the volume becomes unavailable to workload scheduled on node 1. This problem is being handled in case 2 where both the target and workload are on same node. If node 1 goes down in case two then there is no unavailability of volume for the workload, as the workload itself is going to be down.

Talk is cheap, show me the code

It is very easy to schedule the target and workload on same node. We use pod affinity and anti-affinity features of kubernetes to do this. The best part is you don’t need to know any of it, all you need is to specify a unique label on your application and PVC.

e.g. we put the label openebs.io/target-affinity on the pvc and assign the unique value to it. This is the label that should be present on the workload pod. When target has to be scheduled, kubernetes would look for the label value and try to find workload which has this label. Then it would schedule it on the same node where the workload is scheduled.

kind: PersistentVolumeClaim
apiVersion: v1
metadata:
name: openebs-pvc
labels:
openebs.io/target-affinity: nginx
spec:
storageClassName: openebs-cstor-default-0.7.0
accessModes: [ "ReadWriteOnce" ]
resources:
requests:
storage: 5G

Eg. workload manifest:

apiVersion: apps/v1beta1
kind: Deployment
metadata:
name: nginx
labels:
name: nginx
openebs.io/target-affinity: nginx
spec:
replicas: 1
selector:
matchLabels:
name: nginx
template:
metadata:
labels:
name: nginx
openebs.io/target-affinity: nginx
spec:
containers:
- resources:
limits:
cpu: 0.1
name: nginx
image: nginx
volumeMounts:
- mountPath: /var/lib/openebsvol
name: demo-vol1
volumes:
- name: demo-vol1
persistentVolumeClaim:
claimName: openebs-pvc

You can see that both pvc and workload have this unique label nginx.

Demo

Let us create the pvc first and see what happens:

prince@prince:~/go/src/github.com/princerachit/dump/demo-09-11-18$ kubectl apply -f pvc.yaml 
storageclass.storage.k8s.io/openebs-cstor-default-0.7.0 configured
persistentvolumeclaim/openebs-pvc created

If we describe then we can see the status of the target pod as pending:

prince@prince:~/go/src/github.com/princerachit/dump/demo-09-11-18$ kubectl describe pod/pvc-53d7c2c3-f6b7-11e8-8689-42010a80017d-target-78d687cbbfgsch9 -n openebs
..
..
..
..
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 10s (x6 over 25s) default-scheduler 0/3 nodes are available: 3 node(s) didn't match pod affinity rules, 3 node(s) didn't match pod affinity/anti-affinity.

This is because the scheduler is trying to find a pod with label nginx which is not scheduled anywhere.

Let us see what happens when after we create an nginx deployment, the describe of nginx deployment:

prince@prince:~/go/src/github.com/princerachit/dump/demo-09-11-18$ kubectl describe po nginx-df785b5bd-4rjsf
..
..
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 112s default-scheduler Successfully assigned nginx-df785b5bd-4rjsf to gke-prince-cluster-pool-1-1ddefaed-7nd2
Normal SuccessfulAttachVolume 112s attachdetach-controller AttachVolume.Attach succeeded for volume "pvc-53d7c2c3-f6b7-11e8-8689-42010a80017d"
..
..

The target pod would get scheduled:

prince@prince:~/go/src/github.com/princerachit/dump/demo-09-11-18$ kubectl describe pod/pvc-53d7c2c3-f6b7-11e8-8689-42010a80017d-target-78d687cbbfgsch9 -n openebs
..
..
..
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 39s (x21 over 5m16s) default-scheduler 0/3 nodes are available: 3 node(s) didn't match pod affinity rules, 3 node(s) didn't match pod affinity/anti-affinity.
Normal Scheduled 7s default-scheduler Successfully assigned pvc-53d7c2c3-f6b7-11e8-8689-42010a80017d-target-78d687cbbfgsch9 to gke-prince-cluster-pool-1-1ddefaed-7nd2
Normal SuccessfulMountVolume 7s kubelet, gke-prince-cluster-pool-1-1ddefaed-7nd2 MountVolume.SetUp succeeded for volume "sockfile"
..
..
..

Both the target pod and application pod are scheduled on node gke-prince-cluster-pool-1–1ddefaed-7nd2.

References

Checkout the Pull request.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store