Deploying Chuspace globally using Docker
Outlining thought process for deploying simple Rails app globally
Rails app deployment is a hard problem but deploying globally, and close to the reader is even harder. In this edition, we use Docker instead of machine images for Rails application deployment.
A bit of a recap,
Chuspace is a simple Rails app powered by a MySQL database and Memcached running on each node. MySQL database is managed and provided by PlanetScale with replicas spread across the globe. The previous version used Terraform with Packer to build and deploy pre-built machine images.
Issues with previous version:
Complex - building and tracking multi-provider/region machine images on each deploy was slow and complicated. I had to manually copy new images to each region and build seperately for each cloud provider. Autoscaling was hard.
Deploy downtime (up to 10 mins)
I avoided docker to keep things simple and low overhead however it made the deployment process very complex, mainly managing VM images on multiple providers.
Docker solves the application portability problem by turning an application code into a deployable artifact that can be run on any supported platform.
How this new flow works?
There is still Terraform in the picture because it’s just great for managing multiple infrastructure resources. Github CI uses terraform to spin new VMs on each code deploy with a pre-configured user data that installs Docker and runs the application using docker-compose. There is also a systemd file that restarts Docker containers in case of machine restarts.
The terraform module looks like this,
Each new replacement VM is created first then destroyed to reduce downtime,
Application environment variables are managed via AWS secrets manager as you may have noticed above and is injected on demand before running the docker containers,
The logs are aggregated using Logtail from multiple running containers,
Monitoring is done using Weave scope, which also has a nice UI for exec’ing into containers when needed, for example, to run rails console.
Cloudflare manages the incoming traffic using a network load balancer, which routes traffic based on proximity. If a pool becomes unhealthy during deploys, all traffic gets re-routed to the primary region. Primary region is deployed last.
The deployment happens in two steps using Github CI and rake tasks - step one to all regions except the primary and then to primary. I think there is room for improvement here but I haven’t come up with anything solid yet. Docker makes the deployment process simple and easy to reason with. However, there is still some serious downtime during deploys - 30 seconds. I can mitigate that by keeping the old resources around a bit longer and then swapping the traffic using the application load balancer as described in Terraform blue-green deployment tutorial.
Should I use docker swarm?
Maybe? But I have few unique env variables per region so not sure how that would work in swarm mode.
Should I use hashicorp Nomad?
Maybe? But for production deploys it requires few components like Consul, Vault and I don’t want to learn all that now.
Not sure how per region env variable would work here either.