Microsoft Ignite - The Tour - Amsterdam 2019 - Day 2 - Part 3

Part 3 of my second day of the Microsoft Ignite - The Tour conference in Amsterdam on Thursday march 21 in RAI Amsterdam. See my post on the first day here, part one and part two of the second day here.

Scaling for growth and resiliency

14:10 - 15:10 | SRE40 | Elicium 2 | Jeramiah Dooley

“Tailwind Traders has realized that it will need to focus on scaling their application and infrastructure to both handle more traffic than originally expected as well as increase resiliency in the case of failures. Their business demands reliability, so we’ll explore how Azure products can help with delivering it. In this module, you will learn about scaling our application and infrastructure for increased loads as well as how to distribute workloads with Azure Front Door and Azure Availability Zones to protect against localized failures.”

  • Vertical scaling = scaling up: making your box bigger
  • Horizontal scaling = scaling out: adding more boxes to the pool (and the opposite of scaling out is scaling in)

“There’s a spectacular deploy script in the Northwind Trader repo” - Couldn’t find it - @jdooley_clt is looking for it after I asked him via Twitter.

  • “cool down” is an important setting for auto-scaling (depends on how fast your new instance spins up)
  • Azure Paired Regions: “Each Azure region is paired with another region within the same geography, together making a regional pair. Across the region pairs Azure serializes platform updates (planned maintenance), so that only one paired region is updated at a time. In the event of an outage affecting multiple regions, at least one region in each pair will be prioritized for recovery.”. We deploy many of our key apps across the two paired regions in Europe: North Europe and West Europe. That means that if for example a SQL Server update causes problems for one of our sites (which actually happened to us some time ago) we can direct all traffic to the region where the SQL Server update has not been applied yet!
  • Azure Front Door Service is Microsoft’s highly available and scalable web application acceleration platform and global HTTP(s) load balancer. It provides built-in DDoS protection and application layer security and caching.” Sounds like an alternative for Incapsula!
  • Right now we are struggling with naming the two regions/environments that we’re deploying to: one is unnamed and the other is called fail-over (because that was the original plan - having an environment to switch to in case of an outage). Jeramiah showed me a better naming convention: primary & secondary.

Responding to and learning from failure

15:30- 15:30 | SRE50 | Elicium 2 | Emily Freeman

“Tailwind Traders has done a tremendous amount of good work using modern operations principles and practices to create, deploy, monitor, and troubleshoot their applications and infrastructure in the cloud. As an initial effort, this has been superb, but the engineers know that putting processes in place for continuous learning and continuous improvement are the only sure way to provide continuous value to the customers. In this module, we’ll do more than just talk about these processes, we’ll see how they work in action. We pick up the story right in the middle of Tailwind Traders first significant outage. Everything is on fire (metaphorically) and the engineers are struggling to understand the problem and remediate it as fast as possible. We’ll demonstrate not just how the outage is brought under control, but even more importantly, how Tailwind Traders is able to learn from their experience after the fact and improve their systems while doing so. Understanding this process is one of the most important keys to continuous improvement, “leveling up” our operational practices, and getting the most value from our cloud investments.”

This was my second talk of the day by Emily. She started her talk with an image showing a variation on the This Is Fine meme. This is the original:

Some terms:

  • Mean Time To Recover (MTTR): On average (excluding outliers), how long does it take to restore service when a service incident occurs?
  • Cost of downtime: Deployment frequency * Change failure rate * Mean to to recover * Hourly cost of outage
  • Time to restore service: for elite teams it’s “less than one hour” (from the DORA: State of DevOps Report 2016)

Some quotes:

  • E-mail is no place for alerts
  • Alert fatigue is a real thing, don’t do anything to cause that!
  • When it comes to incidents chat is the best channel (Teams, Slack)
  • Create separate incident channels for each incident - it provides focus
  • Try to record video, if you use that

Demo: using Azure Logic App and Application Insights Action Groups
to create a Teams channel with incident number, and post a message to it using string interpolation. Which looked somewhat like this:

1
2
3
incident: {incident-number}
severity: critical
person on-call: {on-call}

Severity of an incident

Sev1 critical Complete outage
Sev2 critical Major functionality broken, and revenue affected
Sev3 warning Minor problem
Sev4 warning Redundant component failure
Sev5 info False alarm or unactionable alert

Incident response roles

  • First Responder (That would be the person on stand-by duty for us)
  • Incident Commander (Lead coordinator and decision maker during an active incident)
  • Subject Matter Expert (Engineers with expertise and knowledge relevant to the recovery of service)
  • Scribe (Responsible for documenting what is not captured in team chat)
  • Communication Coordinator (Responsible for communicating regularly with internal and external stakeholders)

Post incident review

  • How: Here to learn
  • When: ASAP
  • Where: Welcoming, judgement free space

Exercises

Something Emily inspired me to try to organize: just like we have our yearly fire drill - collectively walking down 16 flights of stairs - we should practice our incident response! Just randomly break on of our dev environments and let the teams fix it ASAP without calling in the help of OPS. Award a prize to the team that has the lowest MTTR.