Planning for Service Mesh: A hard look at service mesh as a solution “SMaaS”

Published in

Contino Engineering

14 min readNov 11, 2020

This post will explore the juxtaposition of acronyms, keywords and correlation of the needs versus the actual wants. Specifically, as far as the mesh is concerned.

In my own experience working with the Kubernetes cluster environments, one thing becomes a certainty, — this technical arena is incredibly fast paced. So much so, that as some acronyms come, others get left behind on all-too-regular basis. It becomes ever bit tedious to make sense of the Kubernetes components, dependencies, risk impacts, and overall infrastructure considerations from the high-level design & architecture standpoint, never mind the finer details.

Remember the Kubernetes storage volume plugin system — a driver that every vendor developed, and you had to use, with their storage offerings? Gone. Long live the Container Storage Interface. And this is the new formal way forward. If acronyms or references do not make sense to you — don’t you worry.
Remember the PSP? Gone. This space evolves so fast, if you blink slow-enough, there may be yet-another-optimisation-or-abstraction after this one.
That’s the price of evolution.

I did digress up there, and for a reason. Today, we have an entire new Kubernetes component to contend with and to make sense of, on individual organisational basis.

It is the same story of hype, good marketing, and whole lot of assumptions all over again. And I do hope to get you through a comfortable, personal rediscovery — foremost, of your business requirements that is.

This blog post intends to bring both technical and management readers alike to explore the bigger organisational perspective, as well as to appreciate the finer details. This is in effort to help you make better decisions on the topic of service mesh.

Spoiler Alert — The SMaaS acronym is not a thing.

If you’re reading this, you’re either an Engineer — and it may be that your Kubernetes cluster offering is already rich in features, reasonably well observed, with quality monitoring and the appropriate uptime SLAs, KPIs metrics, for good ‘ol record keep. Or perhaps you’re looking to explore what’s ahead on your Kubernetes(K8s) Cluster development and optimisation journey.

If you’re a part of the technical Leadership team, you’ve got keen eye to cross-map the business needs to probable solution consideration. This Kubernetes service orchestration, and the rich feature offerings therein, is what you’re particularly keen to see the ROI for. And that’s all good to know.

You’re all in the right place.
Enjoy the read, I hope to think there would be something for everyone.

The Foundations

There comes a time, when you have that shiny, new, most importantly STABLE Kubernetes Cluster. Fantastic.

It performs certain magic, — enables your application development team to run idempotent MicroService releases with your CI/CD release model of choice. Yes, in production.

The Engineering has gone through the application load balancers decision designs, the ingress controllers, the security policies, WAF protection, and so on.

At this point your Kubernetes cluster comes, and it so-should, with all the best-practices in place.
The quick and non-exhausting list are the resource quotas, limits, HorizontalPodAutoscaling, PodDisruptionBudget and even NetworkPolicy to keep Cloud Security function happy. Thats node taints and workload tolerations aside.

The K8s building blocks aside, I would also expect the monitoring and observability stack, together with the alerting functionality, to be well understood, and BAU practice. All these tools and processes to be (perhaps battle) tested and appreciated by both platform engineering, and the on-boarded application development team(s).

In short, you may be considering expanding the deployment universe of MicroServices to your production-grade Kubernetes environment.

And things are (should be) going well.

Then you hear about this, — this Service Mesh…

The Awkward Questions

First, let’s examine why you think you need it. And why do you need it Now.

So you looked it up. It looks like it’s what everyone gets to solve all their business and technical scaling issues. It solution sits somewhere after DevOps, but perhaps right with SRE, as it seems to be most certainly AGILE.
Your [Boss|Engineers] certainly seem to want it. And by now, you too are quite keen to get some of that as well…

Alright, I guess this is when we sit-down, get some mind relaxing camomile brew and self-reflect on “the why’s and the how’s”.

Breathe in, Breathe out. Let’s consider the list.

Self-Reflection: The Why

We should carry out a quick, candid questionnaire, and cover a number of perspectives on the matter.

Foremost, does this Service Mesh meet a set of business requirements? And, are there particular alternatives to achieve these requirements?
What is your organisation’s culture predisposition? How Agile are you really?
Do you have the SRE capabilities — or do you have a mature, efficient operation and support function, for the existing Kubernetes environments? (If you have to question the acronym, the answer is a ‘No’ then)
Do you really need the full communication encryption inside your Kubernetes cluster, between the running (east-west) services? (mTLS)
How often do you experience SLA impacting downtime, if any?
How effective is your monitoring function? How comprehensive is your monitoring? Do you have a formed alerting areas-of-interest backlog, which is to be revisited and reviewed?
How effective is the observability across the Kubernetes cluster and MicroService applications?
Have you implemented the Kubernetes Best Practices and CIS advised to your existing Kubernetes Cluster environment(S)?
NetworkPolicies as well? If ‘No’, but you’re already thinking ‘hmm, Service Mesh’ — you most certainly should have.
Can you realistically commit the engineering manpower to assess your existing services’ needs, and learn the route to production, in-situ, to facilitate successful migration to Service Mesh?

You get the idea.

If you have particular concerns or areas of improvement in the above, — I would argue to clean up and optimise the current K8s’ offering, and SLA. Service Mesh enables a fantastic number of features, and if unchecked ones the proverbial can-of-worms as far as complexity is concerned. It’s not for everyone and not everyone needs it, in my humble opinion.

The Service Mesh is not a solution, but it could be part of it.

In simple words— It’s OK to walk away from the Service Mesh, if the timing or the needs are not quite right.

On the other hand, If there is culture, the team — in plural that is — on standby to implement these Service Mesh features, to further enhance the engineering capability supported by a business use case. Super. Dose that with a sprinkle of risk-appetite — because it is what it is.
We can now push on.

Planning for Service Mesh

Before we dive right into the “Getting started” docs, this is the particular bit usually omitted from the “technical prowess“ employed to highlight the awesomeness of the particular product. Still, let’s try answer the good number of the integration questions.

How does this integration look like in the Kubernetes Cluster?
How does the ingress controller design change? (below)
Do you want to make use of traffic shaping within your cluster, to enable you to canary the new service releases? (traffic mirroring, canary, circuit-breaking and even fault-injectionto name the few)
How do I, as an Engineer cater for what usually is, multi-tenancy?
Do you operate in the financial services, and perhaps see the grand benefit of added (mTLS) end-to-end encryption layer that you can add to your MicroServices both ineast-west as well as in north-south scenarios, that is decoupled from your actual application codebase?( I know, it’s quite a handy feature, this)
Finally, to cover in the separate post — what does the Service Mesh release model could look like? (leave comments on this post, if you do)

The Service Mesh — Istio Open Source in my recent case, — has blurred those proverbial lines where the Platform or Infrastructure team starts and finishes their bit. Where does the Application Development team take over the application service relevant infrastructure bit of the stack.

There will be so-called dragons. Expect them.
Luckily enough, given that you’re likely to be operating a production Kubernetes-based workload, this is nothing new. Your application stack, ought to be engineered for failure.

Despite this, as the old saying goes, Hope for the best, but do Expect the worst.

While the organisational culture dictates all-DevOps approach to this, but what does that really mean in the multi-tenant cluster practice, particularly given the application development team, who may now own a good chunk of the infrastructure estate?

We need to run this through the compare and contrast scenario to appreciate the differences and how they get appropriated within the wider organisational so-called Operating Model (responsibility matrix).

The Control Plane and the Data Plane

There are two parts to the Service Mesh — the Control plane and the Data plane. Ironically, both planes I found to be used interchangeably, and thus further confuses both the engineering and the leadership team alike. The visual guide below, I would hope to offer a degree of clarity on this matter.

The initial installation of Istio Service Mesh will deploy all the necessary components inside your Kubernetes cluster. And at most, they are inert.

You can continue operating your workload using the “classic” choice of the ingress controller, being none-the-wiser. Your cluster is still operating top-notch, just as it did before. You should have all the observability and monitoring tuned to keep an eye on the Service Mesh deployment and status as well. Take it slow.
The leadership team now needs to prioritise which apps, in what environments, can be explored to cross over to the Service Mesh ingress side, and appreciate the risks involved.

Istio service mesh comes with its very own deployment, and a service (and cloud platform provisioned load balancer, as non exhaustive list), you will need to dive into the configuration of istio-ingressgateway deployment, and start configuring this Ingress with the VirtualService and theGateway.

See the graphic above to see how it maps through to the existing workloads, and what Istio components/features are available with the associated interaction between such components.

high level overview comparing the service mesh configuration (left) vs typical ingress-controller implementation (right). You can end up with two ingress routes, to enable steady and on-demand switch-over, with dns/endpoint consideration to follow-through with.

Explore the demo service mesh profile, that you can get started with. This way, you get the minimums, along with Istio-IngressGateway and Istio-EgressGateway with the associated Kind configuration and you have most functionality to appreciate the finer service mesh offering feature set.

These are the service mesh components and the demo service mesh profile available with istioctl CLI command when getting started, installing Istio.

The Goal — The BHAG

If you’re an Engineer, you may appreciate the hands-on challenges of each of these technical challenges, below in the Big Considerations section.

For the Leadership, suffice to say — this is by no means a thoroughly exhaustive detailed how-to guide, but can serve as a readiness assessment, to ensure your organisation gets to a seriously enviable future state of infrastructure. Your competitors may froth with envy at ‘we run service mesh’ announcements, but will be first to question your technological prowess, when the ‘unexpected’ downtime eventually does occur. That’s beside the SLAs impact to your business’s customers.

To be clear, The benefits of such Service Mesh rollout effort will bear fruit only with a techno-cultural buy-inat the organisational level.

And the results of such efforts may be realised only with all-hands-on-deck approach to learning, sharing & caring sessions, — as AGILE as you’d like(claim) to be.

You must ensure that theengineering and application development teams collaborate effectively towards the achievement of such production Service Mesh implementation of a Goal — a real BigHairyAudaciousGoal (BHAG) that it is.

Big Considerations (the details)

There is a list of considerations to review and mull over when enabling Service Mesh for your application workload, to review and assess the current vs the future state of Kubernetes infrastructure against.

Service Mesh Only Where Needed
Some applications Will Break when envoy proxy gets injected on their respective Pods. Envoy sidecar container acts as a funnel for all the POD network traffic. What this means is, that if the application container needs to work on Quorum or Connect to other Pods[containers] directly, such Envoy sidecar on every Pod can really complicate things up, with the respective application needing to re-route all comms to 127.0.0.1 for Envoy sidecar to pick up and run with, on each such respective POD. Think Redis or ZooKeeper. See the Namespace section below for a potential workaround.
Network Isolation
By default, all Pods can communicate with all other Pods in a Kubernetes cluster. Make use of theNetworkPolicy for traffic flow management, to ensure that the application traffic does indeed (and only) flows via the intended service mesh ingress route, where appropriate. This is applicable to both single tenanted and multi-tenanted environments, with a minimum use case for Networkpolicy configuration with Namespace as a traffic source/destination whitelist. Configuring a NetworkPolicy would force a single approved route for the traffic to flow over and ensures you get the service mesh Observability brownie point earned as well.
High Availability with PodDisruptionBudget
Setting a good number of minimum available pods for service mesh components, particularly during in-situ upgrades process. Configure the PodDisruptionBudget to set the minimum pod availability count to be maintained at all times, per service mesh component, during upgrades, and nodepool rebalancing act.
Autoscaling with HorizontalPodAutoscaler
Istio service mesh ingress component comes with a Kind: Deployment. Kubernetes cluster operates a metrics-server, which exposes the resource metrics. This means aHorizontalPodAutoscaler (HPA) Kind is available to configure to help to scale the Istio IngressGateway Deployment replica count in line with the performance-required demand. But you need to perform the right-size tuning of service mesh components resource quota, such as Ingress Gateway deployments versus their actual utilisation to ensure the (HPA) works correctly.
Namespace Isolation
Separate application workload with namespaces — logical labelling by the type of workloads (istio-system, app1, app2). Logically separating workload in the appropriate namespace(label) simplifies the envoy-proxy automatic sidecar injection for service mesh integration, since automatic envoy sidecar injection is enabled as a namespace annotation. This also improves the security stance when NetworkPolicy can be applied per-namespace enabling an effective security control.
The Gateways
Both Ingress & Egress gateway components such asVirtualService, Gateway and DestinationRule can persist in one or more namespaces. To maintain a degree of NetworkPolicy and logical control, — ensure these configurations are created outside the default, istio-system and kube-system namespaces. Gateway configuration should represent a dedicated, locally separate entity, usually per-tenant. The symmetric Egress may get a similar treatment, with another dedicated namespace, paired with the NetworkPolicy to match.

It is a real effort between Technical Leadership, the development team(s) and the platform engineering. Service Mesh touches a lot of IaC and impacts everyone. This is most certainly not a cowboy rodeo. Involve everyone, learn together.

My Thoughts

The enviable features offered by the Service Mesh’s own east-west capabilities are impressive, but a risk of getting it wrong is high. And producing the lacklustre results would be the least of your concerns in the worst case scenario, when the team is desperately trying to troubleshoot the service mesh, which itself has gone through an iteration or two of no-longer-backward-compatible updates.
I do not mean to scaremonger though, but to be hyper-realist on the topic.

I do encourage getting started with Service Mesh in that sandbox environment for certain, even today. Thats the learning curve. Demonstrate your learnings sharing any issues with your team alike. Be honest.
Perhaps even having it running in the development environment as a PoC for a good while.

Going back to the operating model, It ought to be clear from the outset that these ever-changing pace of the solution making can breed mixed results. The technical nuances aside, my concern in the service mesh solution-making are people.

Incremental, steady-as-she-goes approach best. Assign designated app owners to really **know** their app, and pair them with the designated platform engineer. Rotate. Mix. Knowledge Share. Rinse & Repeat for all relevant services

People still build Services, and it’s the People, — that still build your organisation’s infrastructure.
In my opinion, without the right organisational culture, — the practice approach of “you build it, you run it” may manifest those very risks you are keen to avoid. It is these silos it’d rather not to embark upon, if that’s considered to be Business As Usual.

I see the parallels to Organisation & Technical Implementation of the Service Mesh, similar to Istio’s very own Two-Plane Architecture. The Control & The Data Plane.

That is, as “Control Plane”, — until the last member of the technical leadership team accepts the vision, having reviewed the Cons & Pros of Service Mesh rollout, — it’s the organisation’s culture and vision to be ascertained. All those WHY questions above to be followed with a satisfactory response.

And eventually, as part of “Data Plane” — it is the collaborative efforts of the application development with the platform engineering team, to keep running these service mesh workshops, the hack days, and keep cycling through these knowledge sharing sessions, which will breed successful integration on the ground.

Those on the cusp of embracing the Service Mesh designs and architecture pushing on ahead, I salute you and wish you favourable winds for your sails.

Now then with the details — the Packet distributed tracing capabilities, Istio metrics and cluster interaction visualisation — the use of Jaeger + Kiali bolt-ons. All Great stuff. I encourage you to get on those, once you tick off all the fundamentals and then some, from the Big Considerations section above.

The shiny UI and graphics will be well appreciated and received — your leadership team will be impressed. And this will be our little secret.

Then consider the mutual TLS, the east-west and north-south (though explore Apigee offering in the latter case) — secure cluster to cluster communication is the next level todo. This does complicate your setup exponentially. You may even wish to dip toes into operating your very own Certificate Authority. It will help with the management of the multi-cluster encrypted communications. Hashicorp Vault with PKI offering may be a great area of interest for you then. But It may be a bit too much at this stage.

Right, I think this is it for now. I hope I delivered on the “there is something for everyone” and look forward to your feedback and comments.

Are you Planning for Service Mesh?
What Problems are you solving?

Connect on LinkedIn or find me on Istio and Kubernetes slack groups.

Best,
J

Like, Share along!
There are quite a number of fantastic projects taking place at Contino. If you are looking to work on the latest-greatest infrastructure stack or looking for a challenge, — Get in touch! We’re hiring, looking for bright minds at every level.

At Contino, we pride ourselves on delivering the best practices cloud transformation projects, for medium-sized businesses to large enterprises.