Presentation: How Containers Have Panned Out

Location:

Duration

Duration: 
4:10pm - 5:00pm

Day of week:

Level:

Persona:

Key Takeaways

  • Hear practical advice from a very early innovator with containers.
  • Understand how Gilt evolved the use of containers into their current architecture.
  • Learn lessons about the struggles/triumphs experienced by Gilt along the way to container adoption.

Abstract

It's been almost three years since Gilt embarked on adopting containers, first using LXC in our physical data-centre in Japan, and then adopting Docker on a mix of physical hardware and virtual machines in Amazon. We've used Docker for continuous, repeatable, immutable deployments of our applications and services; we've used Docker for repeatable build systems, we've also used Docker as a foundational part of the distributed job system 'SunDial' that powers our personalisation and recommendation systems. Now, as we enter Gilt's next stage in evolution as part of Hudson Bay Company (HBC), we're realising the value of having standardised containers that can be deployed easily across private cloud, public cloud, and traditional data-centre infrastructure.

The dust has settled, and now container technology is moving from early-adopters to mainstream. In this talk, I'll provide detailed examples of how we've used Docker and where container technology has given us most bang for buck and, pragmatically, what aspects of the technology haven't panned out as we thought they would.

Interview

Question: 
Your talk is about containers in production. Can you provide the attendees with a bit of background for the talk?
Answer: 
I have been talking a lot about microservices at Gilt during the last year. Over the last 3 or 4 years, we have gone from a monolithic application running in a standard datacenter to a cloud-based deployment on Amazon with about 300 microservices. We exploded our monolith in shatters.
I think the interesting part is the scale of what we are doing. About 3 years ago, we understood we had lots of deployment problems: we would deploy a given service with a whole bunch of other services on the same box, and use a mix of prayer and gut instinct that there was the right level CPU and resources available.
We got to a point where we were building our own cloud/container infrastructure to get the required level of isolation. We had a really big project lined up to do it; we started that and all of a sudden we realized that there was this thing called Docker, and it gave us everything that we were looking for in terms of immutability and isolation.
Then, in tandem with that, close after the Docker realisation, we had this idea of immutable deployments. Straight after that came the understanding that we needed to move to the cloud. So we had this convergence of forces - containers, the need for isolation and immutability, and a desire to move everything to the cloud.
Some of our architecture makes use of Docker, and some of it doesn’t. Some of our deployment is still RPM-based, so we are not using Docker everywhere. That’s by the nature of lagging adoption, not because of a technology choice. If you have 300 microservices, and you want to turn all of them to Docker, then you have your hands full. And you have got to ask yourself, is that the most valuable thing that you could be doing with your time? And we decided not to. So some of these services are fine in RPM, but for all new work we are using Docker. Docker has become for us the target deployment platform.
Question: 
What does your deployment pipeline look like at Gilt today?
Answer: 
Our problem is that we have many deployment pipelines. Because we were so early adopters, there wasn’t just one specific solution for deploying a Docker container to a machine on a cloud: at that time, ECS didn’t exist from Amazon. Docker was only getting off the ground. It was early days, and we had a decentralized approach to tooling. As a result, 7 different teams built 7 different tools. We have come to the realization that writing deployment tooling ourselves does not add any value to Gilt. It’s great fun, everyone loves building their own framework, but it is not adding any value: it doesn’t help us sell any more dresses. And now, Amazon is producing tooling that lets us do this really easily.
That’s where we are now. We are using code deploy, we are using code pipeline, we are using Cloud Formation as part of our deployment.
We have moved away from the continuous deployment dream, to a more developer initiated deployment to production, but there is a lovely sophistication there. We first deploy to a dark canary node, we being the only people who can send traffic to it. Then we upgrade. It becomes a canary release, so one of the nodes is running the new version. Then we release fully to all nodes.
Question: 
Your story is a bit different than the usual container story I hear. Because of how early you were, you had to build your own tooling, and now you are adopting or evolving it. Is that accurate?
Answer: 
Yes. The ecosystem did not exist at the start. Then we ended getting together and forming a single team to figure out deployment, but while that team was on a 9-month or 12-month plan to build the perfect tool for deploying Docker, all the other teams were saying “What’s the quickest way I can get something to production? Let me write a quick script here.” They were all working on their own solution, and then we ended up in a situation where each team has settled on a substandard solution.
By the time we implemented an open source, neat solution, we realized that Amazon was doing it better than we could. It was going to cost us to maintain all that tooling, and that isn’t the game we should be in. That was a real realization for us. We don’t regret the last 3 years. But we’ve landed though on what we think is a very pragmatic solution.
Question: 
How does the story of containers at Gilt come through in your talk?
Answer: 
I want to share our story, as a proof of existence, showing that this stuff works. Sharing what we learned along the way, the mistakes we made. This is valuable for people who are early majority, who may want to become early adopters. They can learn from our story.
Some of the learnings are counter intuitive. One of them is that we are deploying one container per Amazon instance, but part of the docker dream is that you can have multiple containers on a machine. Our workflow requires that each service has full control over the CPU. The nature of our traffic at noon every day on Gilt is if we don’t have full processor isolation, one rogue service can take down everything. We’ve seen it first hand.
As a result, we deploy each service into a docker container on it’s own virtual machine in Amazon, and that is the way we go. That wasn’t obvious from the start.
Question: 
With this long history of containers in production, do you have suggestions or lessons on things like debugging with containers?
Answer: 
We route all of our logs to CloudWatch. Most of our engineers don’t debug production running instances. Typically, when we are developing, we run locally, and we tunnel the rest to production. That way we can debug to a local instance. This has not been a problem for us. We have never said “We can’t debug our thing because we have deployed it under Docker!”
Question: 
Have you been able to trace the same path through a container that served a request when random things happen?
Answer: 
We use New Relic and that has been helpful in instrumenting all of our services. That is our primary tool for seeing what is the issue if anything happens. And using Docker hasn’t created problems in terms of not being able to use New Relic.
One of the things that is interesting is that at some stage everyone loves to just log into the machine. When you run a Docker image and you connect to it with the bash shell, that works fine. But in general, we find we don’t need to get into the Docker instance at all.
Question: 
What is your view on what you have seen happening in the space over the last few years?
Answer: 
We wanted immutability. We wanted to be able to deploy things that couldn’t change. What we discovered then was the right balance: the Docker container should be immutable but the AMI, the actual instance should be mutable. That was a profound result.
We began by making the AMI instances immutable: this was a bad idea as shutting down and provisioning new Amazon instances on every deploy is a slow process. It makes more sense to leave Amazon instances running: leave the instances going, but make the deployment container, the Docker piece, the immutable bit.
We learned we do not need some of the Docker tools. Docker Compose is of no use for us. We wouldn’t dream of using it. And the reason is the web of dependencies between our services, which is so complex that Docker Compose would be just useless. It wouldn’t make any sense.
We also don’t need Docker Swarm. We are just using Docker. We used to have Docker registries as part of our deploy path. But using a third party Docker registry led to an outage on our site. That was like a critical failure. When we looked at it, and we saw that the dream of creating Docker instances and putting them up on a Docker registry is a waste of time. Git is our change management tool, that’s what we use for versioning. We are using now CodeDeploy, and we are storing the images on S3 buckets, then deploying from S3. We really don’t need a Docker registry.
That is a real lesson. It’s unfortunate as well. If I am in the audience, and I am working for Docker, and I am hearing that all the tools that we are building don’t necessarily have a use for us, that is a tough message. They may have a use, but it’s probably in a niche area.
You can adopt Docker and just Docker. You don’t have to think about the wider set of tooling that every salesperson is trying to sell you, because you probably don’t need it.

Speaker: Adrian Trenaman

SVP Engineering, HBC Digital / Gilt & Commiter Apache Karaf

As SVP Engineering, HBC Digital, Ade leads the engineering and infrastructure teams for Gilt in New York and Dublin. He is an experienced, outspoken software engineer, communicator and leader with over 20 years of experience working with teams throughout Europe, US and Asia in diverse industries such as financial services, telecoms, retail, and manufacturing. In the past, he has held the positions of CTO of Gilt Japan, Tech Lead at Gilt Groupe Ireland, Distinguished Consultant at FuseSource, Progress Software and IONA Technologies, and Lecturer at the National University of Ireland in Maynooth. He became a committer for the Apache Software Foundation in 2010, has acted as an expert reviewer to the European Commission. Adrian holds a Ph.D, Computer Science from the National University of Ireland, Maynooth, a Diploma in Business Development from the Irish Management Institute, and a BA (Mod. Hons) Computer Science from Trinity College, Dublin.

Find Adrian Trenaman at

Tracks

Monday, 13 June

Tuesday, 14 June

Wednesday, 15 June