Microsoft Ignite - The Tour - Amsterdam 2019 - Day 2 - Part 1

My second day of the Microsoft Ignite - The Tour conference in Amsterdam on Thursday march 21 in RAI Amsterdam. See my post on the first day here.

GitHub sticker!

Modernizing your infrastructure: moving to Infrastructure as Code

09:30 - 10:30 - SRE10 - Auditorium RAI Theater - Emily Freeman

“Deploying applications to the cloud can be as simple as clicking the mouse a few times and running “git push”. The applications running at Tailwind Traders, however, are quite a bit more complex and, correspondingly, so are our deployments. The only way that we can reliably deploy complex applications (such as our sales and fulfillment system) is to automate it. In this module, you’ll learn how Tailwind Traders uses automation with Azure Resource Management (ARM) templates to provision infrastructure, reducing the chances of errors and inconsistency caused by manual point and click. Once in place, we move on to deploying our applications using continuous integration and continuous delivery, powered by Azure DevOps.”

Emily Freeman

Site Reliability Engineering is an engineering discipline devoted to helping an organization achieve the appropriate level of reliability in their systems, services, and products.

“A characteristic of great teams is that everyone trusts each other.”

The Small Batches Principle

Emily talked about the importance of small batches (when every release contains small amounts of new code the lines causing the problem are easier to find), but that got me thinking: doesn’t that introduce a new problem? When you have many deployments in a short period of time how can you be sure which of these introduced the bug or regression?

Added to my reading list: The Small Batches Principle - Reducing waste, encouraging experimentation, and making everyone happy

Demo: deploying Resource Manager templates from Azure Cloud shell. That’s also to be found in docs.

Monitoring your infrastructure and applications in production

10:50 - 11:50 - SRE20 - Hall 11 - Jason Hand

“Tailwind Traders runs its entire business in the cloud and we need to understand what is happening with our cloud resources 24/7. To do that, we’ve implemented Microsoft Azure’s monitoring solutions. In this module, you’ll learn how we use Azure Monitor to understand and visualize time series data using Application Insights and Log Analytics. We’ll also monitor the health of our cloud services using Azure Service Health. With a more observable and data rich system Tailwind Trader’s engineers are now poised to know about, respond to, and resolve problems in real-time before they impact users and the business’ bottom line.”

Jason Hand

It looks like that this might be a good book to read (for free, online: https://landing.google.com/sre/books/) since both Emily and Jason seem to be quoting it extensively:

“Site Reliability Engineering - Members of the SRE team explain how their engagement with the entire software life cycle has enabled Google to build, deploy, monitor, and maintain some of the largest software systems in the world.”

Jason took Mikey Dickerson’s Service Reliability Hierarchy (elements that go into making a service reliable, from most basic to most advanced) from that resource:

Some quotes:

  • “you can’t have root-cause analysis on a complex system” (You should do a post-incident review instead.)
  • “slow is the new down”

What’s your definition of reliability?

This slide illustrates the many ingredients of reliability - food for thought:

Service Level Indicators (SLIs)

I’ve been looking at it everyday but I’ve never knew there was a term for it! Will be using that…



Service Level Objectives (SLOs)

The same goes for SLI’s sibling: I know for sure I’ll be talking about “the SLI dropping below the SLO” in the years to come!


Quote (paraphrased):

  • “Don’t paint yourself into a corner by setting a too high a SLO” (based on historic performance - like the disclaimer in ads for financial products in the Netherlands warns: “results from the past are no guarantee for the future”).

Demo

His demo of querying App Insights to return charts with an SLI dropping below the SLO at some points really made me want to go out and do those myself. One of the first things I thought of was using it for analytics for this blog, because I find Google Analytics to be almost completely unusable. It looks it’s only useful for tracking the sales your Google Ads have generated, and not very useful to see things like what posts get the most engagement.

Here’s one of the sample queries (in the Kusto query language):

1
2
3
4
5
6
// Top 10 countries by traffic
// Chart the amount of requests from the top 10 countries
requests
| summarize CountByCountry=count() by client_CountryOrRegion
| top 10 by CountByCountry
| render piechart

Bizarrely my blog is more popular in Czechia than Serbia (were some of my team members live):

Actionable Alerting

Bottom line: an alert is only useful when there’s someone who can do something about it, i.e. when it’s actionable. Don’t alert stuff that always fails, don’t alert stuff that succeeds.

A proper actionable alert should have this information:

  1. Where the alert is coming from
  2. What expectation was violated
  3. Why this is an issue (for our customers)
  4. Steps to resolve the problem

Jason demonstrated these principles in a demo where he showed how to have Azure send an automated voice call (“because Ï don’t wake up to e-mail”) when an SLO drops below an SLI. He also showed how you could at the same time send more context about the problem to a Teams Channel (containing the context with the 4 points mentioned above).

I was recently asked to join the rotating stand-by team for Nextens, and I will only agree if we have this set up! ☺

Monitoring the monitoring

That’s a very important point: not receiving beautifully composed actionable alerts might not be good news - it might mean that your monitoring is broken.

He also stated that in case of outage the first thing that must be restored is the monitoring, before anything else! That’s probably very good advise that I can foresee being ignored by a panicked manager in case of a crisis…

I liked the first talks of the day - which were both part of the Site Reliability Engineering track - so much that I decided to change my schedule for the rest of the day. These three talks (one of which was a 15 minute one) were dropped:

So my third talk of the day was also by Jason Hand.

Read about that in part two of this blog post!