When businesses enjoy growth and take steps to expand their data infrastructure to handle the growth, change is inevitable. Moving your data warehousing to the cloud is a powerful and effective way of scaling up. Of course, to optimize the benefits of upscaling, you need to ensure that your company is set up to run on a cloud-native BI and analytics platform. It’s at this juncture when utilizing technologies like Kubernetes and Docker will facilitate and expedite this shift, setting your company up for further growth.

In this blog, we take a look at how Sisense for Cloud Data Teams (Previously Periscope Data) moved from Heroku to Kubernetes to run better in the cloud, and how it offers a useful case study for other companies looking to make the same decision.

Starting with Heroku

For startups, Heroku is a fantastic platform because it can significantly reduce operational complexity, enabling you to focus on shipping product. Setting up the DevOps pipelines to move applications from development to production is easy and it isn’t necessary to create and maintain a continuous delivery system in house. It requires minimal infrastructure for apps to be live, and it’s easy to scale apps as needed.

This was Sisense’s architecture on Heroku:

For a small team trying to find product/market fit, this is ideal, while the team is focused on building and iterating on its product. But as a business grows and the focus shifts to scaling the product, working on Heroku becomes challenging.

Why leave Heroku?

As a business grows, so does its infrastructure requirements. Heroku isn’t able to keep up with all of them. Some of the new requirements that arise are: 

Requirement #1: Granular control over server resources

Heroku offers 4 distinct types of dynos: 1x, 2x, Perf M, Perf L. Being locked into these four options is very limiting when you’re working at scale. Much finer control over resources is necessary to economically scale applications’ infrastructure.

Requirement #2: Reliable scheduling of background processes

The Heroku scheduler is a very minimalistic application, providing limited control over how frequently to schedule jobs (daily, hourly or 10 minutes). Additionally, it’s difficult to control versions. If a job needs to be disabled, the job definition would be lost. Worst of all, it is a “best effort” service from Heroku — sometimes the scheduler skips runs, and that’s expected behavior.

Requirement #3: No databases or backend microservices should be internet-facing

Moving postgres databases out of Heroku enables you to have better control over databases. When super user privileges and better security is needed, Heroku databases can’t be firewalled and many organizations are unwilling to allow their database to be internet-facing. Since private networks on Heroku are unusually expensive, a solution is to send database connections from Heroku to RDS (in a VPC) through SSH tunnels. This increases operational overhead and impacts on performance. If you want to add connection pooling using tools such as  pgbouncer, that along with SSH tunnels means another layer of complexity. Each new microservice needs a similar networking setup, and the overhead becomes too much to manage.

Requirement #4: Minimal external dependencies affecting product uptime

Any dependency on an outside tool creates a ceiling for the performance of a system. With Heroku, this means a ceiling on the uptime guarantee that can be provided to your customers. 

Heroku also impacts on a release schedule. Let’s say you ship code multiple times a day, so you make it a practice to never ship when any relevant Heroku component has a yellow or red status. You can be burned by compounding Heroku outages with your own mistakes, so it’s risky to ship  during a Heroku event, even for seemingly unrelated issues. This often leads to delays or skipping deploys, which reduces your ability to ship product to customers.

Requirement #5: Infrastructure should be developed with scalability in mind

With every customization and add-on, you build your infrastructure the Heroku way. The sooner you move out of Heroku, the easier it is to migrate applications to a more scalable architecture that could serve you long term. That includes managing the operational infrastructure, monitoring, continuous integration and deployment pipelines, etc. In the long run, it is always going to be more expensive to not manage your own core operations.

So, while Heroku works for any young startup, you literally grow out of it as a platform, and it becomes prudent to bring more infrastructure in-house, with fewer external dependencies and more direct control.

Why choose Kubernetes?

Having decided to bring your infrastructure in-house, the decision to use Kubernetes is a relatively easy one. It’s a more container-based platform that provides the flexibility you need to build and manage a cloud-native application.

Kubernetes offers an excellent set of tools to manage containerized applications. You can think of it as managing a desired state for the containers. Here are some of its attractive features:

  • Scalability: Applications can be easily scaled up by increasing the resources allocated. Scaling out is as easy by increasing the number of replicas for a controller. Horizontal autoscalers can automatically scale based on some resource values.
  • Versioning: A new version of the application creates a new replica set when deployed; in case of a regression, it can be easily rolled back to the old version.
  • Distribution: A rolling update strategy allows you to release a new version of the application n pods at a time, thus limiting the number of unavailable containers at any given time.
  • Load Balancing: Kubernetes gives containers their own IP addresses and a single DNS name for a set of containers, with the possibility of load-balancing across them. [more info]

Once you’ve tested Kubernetes with your services internally, you get a taste of the power it offers, and it becomes clear why Kubernetes is quickly becoming the gold standard of the industry.

The Migration

Building a plan

At the time, Sisense’s web app had over 50 background jobs and about half as many microservices. The web app was Rails and background jobs were mainly written in ruby, with a rake interface. Microservices were in golang, with the exception of one, which was in Java. Web  web assets were compiled using yarn.

As part of this migration, the decision was taken to move the microservices first. This would lay the foundation for a live Kubernetes setup, and subsequently an internal setup for development and test. Moving the microservices first was also lower risk; it’s much safer to route to microservices hosted at different domains than to redo the networking in front of the web servers. It was also necessary to select a technology for DevOps; to set up a build and release pipeline for Kubernetes, and build/store docker images.

Next, the background jobs were moved to the new system, so that the ruby side of the application could then  be addressed.

The last step would be to migrate the web app. This would involve compiling web assets, routing and serving http requests, builds and releases and load testing for performance. The key challenge here would be to manage the rollout and to move live traffic from Heroku to Kubernetes.

How was this achieved?

A comparable architecture of the Sisense system in Kubernetes follows and the sections below talk about some of the lessons that were learned during migration.

Lesson #1: A reverse proxy app can be a powerful tool to manage HTTP requests

The Heroku router returned error codes such as H12 (Request timeout), H15 (Idle connection) and H18 (Server Request Interrupted), which were useful in hiding complexity from the alert configs. For example, our “Too many request timeouts” alert was configured as: count:1m(code=’H12′) > 50.

If the rails app took longer than 30 seconds, the Heroku router returned a 503 response before the request was completed. So it was agreed to implement something similar using an HAProxy app. The key challenge here involved setting the right HTTP options based on the desired behavior and tuning the client, request and response timeouts between the proxy and the load balancer. This resulted in a very efficient management of the requests [more info].

Early on, there was an HAProxy issue where the requests would fail as the proxy app was unable to resolve the server host. It turned out that for the DNS resolution to happen correctly, it was necessary to configure health checks for the proxy app [more info].

Lesson #2: Horizontally scalable services don’t always scale across different deployments

While moving the first background job to Kubernetes,  a concurrency bug occurred that would result in the same job running multiple times. In Sisense’s product, users can clone dashboards and doing so schedules a background job for the cloning process. The job was being picked up by multiple microservice instances and the dashboard would be cloned multiple times. The problem was rooted in the use of an env variable in the homegrown job runner to disambiguate between running jobs and those that were killed during a release process. The job runner thread would lock a job with a specific release version to ensure it didn’t get picked up by other threads. The jobs in Kubernetes had a different version format than on Heroku. The versioning logic that managed locking became too complex, so eventually a switch was made to a much cleaner solution that would make it version-agnostic. This prompted a re-evaluation of all Heroku env variables.

Lesson #3: A carefully chosen ratio of concurrency resources can lead to a highly optimized setup

One of the experiments during this migration was to get the right balance of the pod size (CPU and memory), the number of Phusion Passenger application processes and the number of threads per process. This somewhat depended on the type of requests. Long-running requests would behave differently compared to short requests. CPU-bound requests would have a different usage pattern compared to IO-bound requests. The type of 5xx errors from the proxy app, along with the number of LivenessProbe/ReadinessProbe failures on the pods were a good indicator of whether a combination was worth pursuing. On Heroku, Sisense for Cloud Data Teams (Previously Periscope Data) was using 8 PerfL dynos with 4 processes and 7 threads each, and switched to 75 smaller pods on Kubernetes with 1 process and 12 threads each. Smaller pods minimized the impact to customers when a pod went down due to an error state. HAProxy docs were a great resource for understanding the errors and various articles online that assisted the team in coming up with a good starting point for the ratios of resources/processes/threads [more info].

Lesson #4: Achieve zero downtime during migration by using a reverse proxy app

Before the Web went live on the new system, it was put through load testing via JMeter. Since no amount of testing could truly simulate live traffic, it was decided to dynamically send traffic to both the Heroku and Kubernetes systems using a percentage-based rollout. This gave the team an ideal platform to manage traffic in real time and make adjustments as needed. A reverse proxy app was created, using HAProxy, to sit in front of the web application. This somewhat simulated the Heroku router functionality, but with a potentially different behavior and the introduction of a certain amount of latency in the process. The proxy meant it was possible to redirect partial traffic to Kubernetes using our cookie-based routing logic (described below). Introducing the proxy app in front of all live traffic was one of the most nerve wracking moments of this migration.

In the Sisense app, customers have the option to create multiple “sites,” and an authorized user has access to one or more sites. Site-level cookies are used to implement routing. Each request was checked to see if it contained a Kubernetes-enabling cookie and if it did, routed traffic to the Kubernetes server. By default, the traffic would go to Heroku at first. This process made routing simple and fast. The request-level timeouts on the proxy app required some tuning, but a working system  for traffic redirects was established and a blacklist to exclude sites that had specific networking needs until it was possible to address them. It became simply a case of changing a flag for sites or changing the rollout percent, and the next requests for the affected sites would be routed to Kube.

The cookie-based logic looks something like this:

if kube_enabled && !cookies[:periscope_web_kube].present?
   cookies[:periscope_web_kube] = { value: 1, expires: 1.year.from_now }
elsif !kube_enabled
   cookies.delete(:periscope_web_kube)
end

And the proxy logic:

# if cookie is present, then redirect to Kubernetes
acl cookie_found hdr_sub(cookie) periscope_web_kube
use_backend kube if cookie_found

Over a period of 2 weeks, the web application was shifted from being almost entirely in Heroku to almost entirely in Kubernetes.

Breakdown of traffic during the transition from Heroku (purple) to Kubernetes (green)

Lesson #5: Managing releases to two production environments is highly error prone

For a significant amount of time, Sisense deployed to both Heroku and Kubernetes to ensure it would have a fallback. Since the methodology and the amount of time to deploy to the two systems was different, it was difficult to accurately control when the new deploy would be fully available to all requests. The application needed to account for the possibility that different versions of the code could be active for a significant amount of time. This meant that every deploy had to be backward compatible. Features were put behind flags, so they could be turned on and off in a predictable manner. Database migrations were deployed as a separate earlier release from the corresponding code. If there was a regression, the rollbacks happened on both systems. Deployments generally took longer early on, and were prone to errors because of the dependencies involved.

Lesson #6: Database connection pooling is a great optimization in conserving database resources

Since Heroku did blue/green deploys, the number of database connections almost doubled for a brief period of time during deployment. Because partial traffic was being sent to Kube, that further added to the connection count. This resulted in some instability and an increased number of 4xx and 5xx errors from the application. During this time, a running count of the number of expected database connections at any given point was maintained. This was an error-prone process. Eventually,  a separate pgbouncer service for pooled connections was created and that resolved the connection count issues.

Lesson #7: Make sure to explicitly specify CPU and memory, request, and limit values

Sisense monitors its infrastructure using Datadog and Scalyr. For each application it moved, it started with a slightly overprovisioned set of resources, then scaled down as possible. Although it was difficult to get the exact equivalent metrics across platforms (CPU share vs. CPU core vs. vCPU), it was very easy to scale in or out with Kubernetes. The docker.cpu.usage and kubernetes.memory.usage metrics were very useful in determining resource utilization.

One issue was when either the request or the limit or both were not specified for a replication controller, it would lead to unpredictable resource usage on the host, making the system unstable. Specifying resource requests and limits was the simple fix that worked.

Lesson #8: CPU throttling looks like 5xx errors

New Relic was used to track application level metrics. The key metrics tracked during migration were the Average Response Time and the Request Queuing Time per request. Since a reverse proxy app had been introduced in front of the requests, it introduced some latency. The request queueing times were monitored and  the proxy configurations (mainly http options, timeouts) were tuned to minimize the queueing time.

Also, the number of 4xx and 5xx error responses from the application were tracked. As a starting point, this was to ensure these numbers were comparable to what had been on Heroku. Requests seemed to be more CPU bound and when there weren’t enough resources to support the demand, there were failing requests with 500s, 502s, 503s and 504s. Simply adding more replicas solved the issue (a horizontal autoscaler would have surely come in handy).

CPU throttling
CPU throttling when there are not enough resources to handle requests

 Lesson #9: Be proactive with training and enablement

An often ignored (but a very important) thing to plan for is spreading knowledge of the new systems internally. This needed to be done at the same pace as the migration to allow internal teams to support live systems, while other teams continue to develop new features. Wikis and scripts were created to make operational tasks repeatable. Oncall alerts were evaluated and reworked to prevent any false positives.

Sisense for Cloud Data Teams on Kubernetes

Having completed the move to Kubernetes and done a lot more with it, Sisense still used Heroku for one thing: PR Apps. This is because Heroku manages that well, and setting up an environment to do that internally would be rather tricky. 

Bringing the operational infrastructure in-house is a huge undertaking. It may be less efficient early on, but it opens the gates for scaling your company and laying the groundwork for future growth and optimization.

Tags: