December 18, 2017

Day 18 - Awesome command-line fuzzy finding with fzf

By: Nick Stielau (@nstielau)

Edited By: Sascha Bates (@sascha_d)

It’s all about the CLIUX

Ok, CLIUX is not a word, but command-line interface user experience is definitely a worthy consideration, especially if you use the command line on a daily basis (or, hourly). Web apps and graphical user interfaces usually get the design love, but this holiday season, treat yourself to a great command line user experience.

Now that we’re thinking about CLIUX, we can roll up our sleeves and see how we can improve some of our everyday command line operations. This can be as simple as adding in some colors to our shell prompts or big as changing over to a new shell.

We’re going to focus on a few ways to make finding things easier.

Enter FZF

A lot of CLI is about finding stuff. Sometimes that is finding the right files to edit, or searching through git logs to review, or finding a a process Id or user Id to use as a parameters for other commands (kill, chgrp, etc).

Fzf is a fuzzy finder, a CLI tool that is explicitly meant to bridge the technical need to find things with the user need to find them easily. It has been around for a few years, but this is the year you use it to make your CLIUX awesome.

FZF Logo

FZF adheres to the Unix Philosophy of doing one thing, and doing it well and valuing composability. These two characteristics make it likely we can find lots of ways to use fuzzy finding, and to integrate it into our daily operations.

Quick shoutout to junegunn for authoring fzf and Tschuy for introducing fzf to me.

Using FZF

Getting started

You can install FZF with package managers, brew install fzf for OSX and dnf install fzf for Fedora, or check out the full installation instructions for other options. Once you’ve got it installed, we’ll start fuzzy finding!

Step 1) Piping data into FZF

We’ll start off looking at how we can pipe an input set to fzf and doing some fuzzy finding. Fzf can fuzzy search any input from stdin. For starters, let’s pipe our dictionary through fzf,

cat /usr/share/dict/words | fzf

and start typing to narrow down the list to matching words: c l n j f i. Ok, so there is one word out of the 235,886 in the dictionary that contains those letters in that order. Clanjamfrie.

Searching for clanjamfrie

It’s a Scottish word that means spoken nonsense, as in"Anyone who doesn’t like fuzzy finding is just spouting clanfamfrie.“ Who knew?

Step 2) Piping data out of FZF

Now we are getting somewhere. Selecting a dictionary word is pretty useful, but how can we make this really great? That’s right, piping our selected word to cowsay.

cat /usr/share/dict/words | fzf | cowsay
fzf and cowsay FTW

Step 3) Live preview

Sysadmins everywhere are on the edge of their seats. Cowsay, dictionaries, fuzzy finding. It’s almost too much. How can we take this to the next level?

Using the --preview flag, we can specify a program for a live preview of our fzf selection. In this case, we can get a preview of exactly how the cow will say it:

cat /usr/share/dict/words | fzf --preview "cowsay {}" | cowsay
Preview that moo

Step 4) Gitting More Pragmatic

Not that cowsay isn’t a real world use-case, but let’s get into something more pragmatic, like… git! This gem of an example is from the ample examples on the fzf wiki (although paired down a bit for consumability here). It shows how to search though git logs and examine a diff. ”`

fshow - git commit browser

fshow() { git log –graph –color=always \ –format=“%C(auto)%h%d %s %C(black)%C(bold)%cr” “$@” | fzf –ansi –bind “enter:execute: (grep -o ‘[a-f0-9]{7}’ | head -1 | xargs -I % sh -c ‘git show –color=always % | less -R’) << ‘FZF-EOF’ {} FZF-EOF” } “`

The big new addition here is the --bind option, which can specify a program to execute upon key press or selection. The fshow function uses this functionality to view the git diff with less when enter is pressed.

Here’s the fshow function running against the fzf codebase:

Finding a git commit

Step 5) Make viewing diffs easy

This is pretty cool, but we already know how to make it cooler. With --preview!

fshow() {
  git log --graph --color=always \
      --format="%C(auto)%h%d %s %C(black)%C(bold)%cr" "$@" |
  fzf --ansi --preview "echo {} | grep -o '[a-f0-9]\{7\}' | head -1 | xargs -I % sh -c 'git show --color=always %'" \
             --bind "enter:execute:
                (grep -o '[a-f0-9]\{7\}' | head -1 |
                xargs -I % sh -c 'git show --color=always % | less -R') << 'FZF-EOF'
                {}
FZF-EOF"
}
Previewing diffs

Wow! Does anyone else feel like we just implemented Github in like 6 lines?!?!

Step 6) Man Explorer

Or, whip up a man page explorer with search and live-preview. Once you have the format down, you can imagine more use-cases that boil down to fuzzy finding with a detailed preview that you can select to execute additional commands, optionally falling back to the beginning.

# This is ugly.  Refactoring left as exercise to reader...
function  mans(){
man -k . | fzf -n1,2 --preview "echo {} | cut -d' ' -f1 | sed 's# (#.#' | sed 's#)##' | xargs -I% man %" --bind "enter:execute: (echo {} | cut -d' ' -f1 | sed 's# (#.#' | sed 's#)##' | xargs -I% man % | less -R)"
}
Man Page Explorer

Why I can’t live without FZF

Ok, I guess I could live. But my command line useage would be a little sadder. For example, I use these kubernetes config helpers every day. In addition to saving some time, I get a little bit more joy out of the fzf implementations. I love the command line, but that doesn’t mean I don’t want a great user experience. Heck, I deserve one.

# short alias for picking a Kube config
# Find cluster definitions in ~/.kube and save one as variable.
c () {
  export KUBECONFIG=$(find ~/.kube -type f -name '*'"$1"'*' -exec grep -q "clusters:" {} \; -print | fzf --select-1)
}

# helper for setting a namespace
# List namespaces, preview the pods within, and save as variable
ns () {
    namespaces=$(kubectl get ns -o=custom-columns=:.metadata.name)
    export NS=`echo $namespaces | fzf --select-1 --preview "kubectl --namespace {} get pods"`
    echo "Set namespace to $NS"
}

# short alias that uses chosen namespace
k () {
    kubectl --namespace=${NS:-default} $@
}

Choosing a namespace:

Kubernetes Namespace Chooser

Conclusion

If you want to spruce up your command line user experience, dig in with fzf and put the power of fuzzy finding to use. Junegunn’s fzf is a simple, composable tool that can make your command line more efficient and usable for 2018.

References, Details and Inspiration

December 17, 2017

Day 17 - Don’t Fall for the Hybrid Cloud Trap

By: Andrew Shieh (@shandrew)
Edited By: Alfonso Cabrera (@alfonso__c)

Your December may be full of thoughts and planning around the future of your computing infrastructure. If so, try to take a vacation! Still thinking about your infrastructure? If you’re still operating your own hardware, you’ll need to evaluate how the cloud fits into your technology strategy. Understanding your choices requires a quick look at the recent history of the systems readily available.

A Short and Incomplete History of Cloud Computing

1995: This web thing is cool, and I can pay a web host to handle my Under-Construction.gif! But I need more on the backend, and don’t want to share resources with 254 other users. I’m going to keep running my own.

2000: I survived Y2K! This VMware product looks neat, but what in the world would I do with it? Running multiple “servers” on my servers seems like more work for me.

2006: So Amazon, the book retailer, is trying to sell me storage, by the byte? Seems odd, but I like the idea better than my current storage strategy of buying new NFS servers every time we’re nearing capacity. It typically endswith a mess of different server types and disk configurations.

2007: Hm, Amazon’s offering virtual servers too. They seem pretty weak compared to what we are running ourselves, and are a bit pricey, and unreliable. Maybe we can try them out for some temporary usage, or try using them for peak loads, but our data center’s servers are faster, cheaper and more reliable. We know how to run them.

2010: That AWS ball has really been rolling! They’ve lowered their prices, added load balancing, autoscaling, and private networking. It’s beginning to look more like my nice, stable data center. We could try them out, Netflix seems to like it, but we’re quite happy with our blazingly fast SSDs, and AWS has nothing like that.

2012: Google and Microsoft are getting into this game too? Maybe we should take this more seriously. And OpenStack makes it look like I can run the same kind of things on my servers. Wait, AWS has SSDs now? I just spent a month rotating SSDs out of all of our RAIDs to fix their #%! firmware bugs. Hmmm.

2014 : Virtual GPUs, 100 more things

2016 : 10x of 2014

The pace of development of cloud computing, especially in AWS, resembles that of a paperclip manufacturing AI. The concept of having your computing infrastructure completely hosted by a third party has gone from controversial to “boring technology”. Even the virtualization technology that we rarely surface has improved at an astonishing rate. All of these advancements feed back into the development machine, making the technologies cheaper and better.

Did I Miss the Boat?

You saw allthis cool stuff go by, but you’re still running your own data center. It’s worked for you, and you don’t have the staff and the time to migrate to something else. But your management is pushing you to have better peak load capacity, lower costs, and greater agility. You finally decide to reply to one of the hundreds of emails you have in your Vendor inbox, on Hybrid Clouds for Easy Cloud Migration. They promise you that you can keep running in your data center while you move to the cloud at your own pace. Maybe you can even keep the same infrastructure running in both.

In some short time window in our cloud past, this may have made sense. Using the cloud for peak loads made sense; using the cloud for things where the servers were disposable made sense. But while you were busy running your own metal, the cloud grew cheaper and the uptimes got way better than yours. The benefits of moving everything to the cloud grew, far past the break even point. Every day you’re not there is a larger opportunity lost.

A consultant/vendor-supported hybrid cloud solution doesn’t get you on the boat fast enough. They’re looking for your dollars every month in perpetuity, while your goal should be around migrating to 100% cloud. Supporting two technologies and infrastructures adds complexity to your architecture. It also adds cost and stretches your staff thin. The trap happens when supporting the hybrid model sucks your resources to the point where you cannot complete your migration to the cloud–the last bits are the hardest.

To avoid this trap, you’ll need to carefully plan your management of your datacenter and cloud systems. Developing their configurations in parallel rather than trying to plaster an awkward unifying layer on top will pay off in the long run. Follow best practice recommendations of your cloud provider. Join the communities of people using the same provider; your systems are inevitably much more similar, and working together is even more valuable.

Make 100%-cloud your priority, and have a quiet, pager-free SysAdvent.

December 16, 2017

Day 16 - Inspec gives insight

By: Jamie Andrews (@vorsprungbike)

Edited By: Ben Cotton (@FunnelFiasco)

What is Inspec?

How often to do want to scan your servers, all of them, to check a library has the correct version or that the right disk clone is mounted?

Configuration control systems like Puppet, Chef and Ansible have become de rigueur in running even modest sized groups of servers. Configuration control offers centralised control over a state which is set by policy.

However, sometimes something happens to one or more servers which challenges this happy state of affairs.

There might be a change requirement that doesn't fit in with the way that the configuration system works. You might decide that part of the system should be managed in a different way. Someone in your team might "temporarily" disable the configuration agent and then not turn it back on again - configuration drift happens

When it does happen, inevitably there are hard to trace and puzzling to understand problems.

This might be a job for a quick shell script and some ssh magic. But often I've done this and then realised that it's difficult to get just right. Inspec is a framework for systematically carrying out these types of tests.

Inspec can be very useful in these circumstances. It works like a configuration management tool like Puppet or Chef in that it holds a model of what the correct configuration should be. However, it does not modify the configuration. Instead it tests the targeted systems for compliance to the correct configuration.

I have found this to be useful in the situation where there are many systems, just a few of which have a problem. Inspec can pinpoint the problem systems, given a configuration profile.

And of course Inspec can be used to proactively check servers to detect problems before they occur

Inspec is based on an earlier system called "serverspec" and it uses the ruby classes from the rspec package.

Although it is promoted by Chef as part of their wider product offering, it works just fine standalone and is fully open source

What I'm covering in the article

Below, I'll look at installing Inspec, making a simple project "profile", how it works in action and installing a third party security checking profile

Setting up and installing

The easiest way of installing it is to use a package. Packages for Redhat and varients, Ubuntu, Suse, MacOSX and MS Windows are available here https://downloads.chef.io/inspec

Inspec has a dependency on Ruby. The packages include a bundled version of Ruby that avoids compatiblity problems. If you already have ruby installed and want to use it then "gem install inspec" is available. See the github repo https://github.com/chef/inspec for more details

To check Inspec is installed try this command

inspec version

Which will come back with the current installed version

Inspec does not have to be installed on all the target nodes. I have it installed on one admin host with ssh access to everything else. This allows any profile rule sets to be tested on anything it can ssh to. No agents, install of ruby or anything else is required

Make your own profile

inspec works by using a profile directory with a a set of control files that contain tests The "init profile" cli command is used to make a new profile

To see a generic blank profile do

inspec init profile example

The output from this command is something like

$ inspec init profile example
WARN: Unresolved specs during Gem::Specification.reset:
      rake (>= 0)
WARN: Clearing out unresolved specs.
Please report a bug if this causes problems.
Create new profile at /home/jan/temp/example
 * Create directory libraries
 * Create file inspec.yml
 * Create directory controls
 * Create file controls/example.rb
 * Create file README.md

It has made a set of files. The most interesting is "example/controls/example.rb"

This is very simple test that checks if /tmp exists, take a look at it

# encoding: utf-8
# copyright: 2017, The Authors

title 'sample section'

# you can also use plain tests
describe file('/tmp') do
  it { should be_directory }
end

# you add controls here
control 'tmp-1.0' do                  # A unique ID for this control
  impact 0.7                          # The criticality, if this control fails.
  title 'Create /tmp directory'       # A human-readable title
  desc 'An optional description...'
  describe file('/tmp') do            # The actual test
    it { should be_directory }
  end
end

The tests can be declared as "plain tests" or "controls". Being a control adds some metadata which makes it easier to track the test within a set_con

The actual test assertions "it { should be_directory }" follow the rspec syntax. The tests operate on a resource type, in this case "file". There are many useful built in test resourse types, including

  • apache_conf
  • crontab
  • docker
  • etc_fstab
  • http

And a lot more, see https://www.inspec.io/docs/reference/resources/

A more complex example

Here's a real test I wrote a couple of weeks ago to deal with a DNS configuration drift.

The old DNS servers had been retired but we noticed that some servers still mentioned servers on the old network.

# encoding: utf-8

title 'DNS'

# some crude way to build a list of network interfaces
eth_files= ['/etc/sysconfig/network-scripts/ifcfg-eth0']
eth_files << '/etc/sysconfig/network-scripts/ifcfg-eth1' if file('/etc/sysconfig/network-scri
pts/ifcfg-eth1').exist?
eth_files << '/etc/sysconfig/network-scripts/ifcfs-ens32' if file('/etc/sysconfig/network-scr
ipts/ifcfg-ens32').exist?

control 'resolv' do                        # 
  impact 0.7                               # 
  title 'check old dns is not present'     # 
  desc 'old dns'
  describe file('/etc/resolv.conf') do     # The actual test
    its ('content')  { should_not match /193/ }
  end

  eth_files.each do|ef| 
  describe file(ef) do
    its ('content') { should_not match /^DOMAIN=193/ }
    its ('content') { should_not match /^DNS[123]=193/ }
  end
  end

end

I won't explain exactly how it works, you can see that there are regexps in there and that a ruby "each do" construct is used.

To run the tests do

inspec exec tc-config myuser@myhostname

screen shot of it running

As mentioned above, Inspec does not correct these problems. It is great at one job: checking compliance. Once the problem is found then you will have to devise a good method for fixing it.

I when looking at large numbers of live servers I usually run it in a shell and then redirect the output to a file. Once the time consuming checking is done I look at the file The colourizing makes it easy to spot the non-compliant areas

In the above example, I found 12 servers non-compliant out of 146. Some problems were found where Puppet conflicted with Redhat system policy I devised a simple, non-idempotent bash script and applied it to the affected servers only. This was quicker and a more certain result than running it on all the servers. After the correction, I reran the Inspec profile to see that everything was ok

Check to check the check

Once you start trying to use your own tests there is always scope for typos or syntax error or other sorts of mayhem. Inspec tries to help with a static checker.

inspec check example

Comes back with a report of how many controls there are plus if everything is valid.

This feature is a great idea, especially for those of us that are only using this tool occassionally.

Ready to use sets of tests

As the Inspec system is supported and promoted by Chef there are a set of profiles ready made that perform various types of compliance. These can be downloaded and used, see https://supermarket.chef.io/tools?type=compliance_profile

Installing and using the CIS DIL

One really useful profile is the CIS Distribution Independent Linux Benchmark https://github.com/dev-sec/cis-dil-benchmark

To try it, clone that github repo and then

 inspec check cis-dil-benchmark

It has 88 controls, many of which check multiple resources. On our LAN running it against a host took over 3 minutes.

The report it generates is interesting reading.

We will be using this profile for testing images generated with Packer and Puppet. The issues it reports will act as feedback for security improvements to our configuration scripts in Puppet and Packer

Further features

I am just scratching the surface with the many features that Inspec offers.
Please do look at http://inspec.io for a fuller run down!

Thanks

Thanks to Chef for supporting this excellent software and keeping it open source

Thanks to my employers, Institute of Physics Publishing for enabling me to try out cool stuff like this as part of my job.

December 15, 2017

Day 15 - A DevOps Christmas Carol

By: Emily Freeman (@editingemily)
Edited By: Corey Quinn (@quinnypig)

The DevOps Christmas Carol is a bastardized, satirical version of Charles Dickens’ iconic work, A Christmas Carol.

It’s Christmas Eve and we find Scrooge, a caricature of a San Francisco-based, VC-backed tech startup CEO haunted by Peter Drucker’s ghost — who warns him of the visits of three ghosts: the Ghost of DevOps Past, the Ghost of DevOps Present and the Ghost of DevOps Yet to Come. (Victorians were seriously wordy.)

Scrooge’s company, Humbug.ly, has adopted DevOps, but their Tiny Tim app is still in danger of falling over. And Scrooge is still complaining about that AWS bill.

I want you to laugh at the absurdity of our industry, remember the failures of yesterday, learn the lessons of today and embrace the challenges of tomorrow.

Above all else, Merry Christmas, Chag Sameach and Happy New Year. May everyone’s 2018 be better than the dumpster fire that was this year.

STAVE I

Old Peter Drucker was as dead as a doornail. (I don’t exactly know what’s dead about a doornail, but we’ll go with it.)

Drucker had been dead for many years. Every DevOpsDays deck included a Drucker quote. And many shots had been consumed playing the Drunker Drucker drinking game.

Scrooge was your average SF CEO. His grandfather was business partners with Drucker and Scrooge continues to worship the Drucker deity.

It was Christmas Eve and Scrooge sat in his glass-enclosed office drinking artisanal, small-batch coffee. His cool disposition was warmed thinking about the yacht he would buy when his startup, Humbug.ly, IPO’d.

Sure, they had unlimited vacation. But no one ever took it — even on Christmas Eve. His employees loved working that much.

He watched the developers and operations folks gather for standup in the “Innovate” conference room. It was great to see the teams working together. After all, they had spent $180,000 on a consultant to “do the DevOps.”

“A merry Christmas, uncle!” His sister had forced Scrooge to hire his cheery nephew as an intern.

“Bah!” said Scrooge, “Humbug!”

“Christmas a humbug, uncle!” said Scrooge’s nephew. “You don’t mean that, I am sure?”

“I do,” said Scrooge. “Merry Christmas! What reason have you to be merry? I pay you $19 an hour and you live in a closet with 4 other men in Oakland.”

The receptionist quietly tapped the glass door. “Two men from Homeless Helpers of San Francisco are here to see you.” Scrooge waved them in.

“At this festive season of the year, Mr. Scrooge,” said the gentleman, “it is more than usually desirable that we should make some slight provision for the poor and destitute, who suffer greatly at the present time.”

“I thought we were just moving them to San Jose,” replied Scrooge.

Seeing clearly it would be useless to pursue their point, the two men withdrew, mumbling about hate in the Trump era.

The day drew to a close. Scrooge dismounted from his Aeron chair and put on his sweatshirt and flat-brimmed hat.

“You’ll want all day tomorrow, I suppose?” said Scrooge to his employees, whose standing desks were packed as tightly as an Amazon box filled with Cyber Monday regret. “I suppose you must have the whole day. But if the site goes down, I expect all of you to jump on Slack and observe helplessly as Samantha restarts the servers.”

Scrooge took his melancholy dinner at his usual farm-to-table tavern. Walking home on the busy sidewalk, Scrooge approached his doorman only to see Drucker, staring at him. His demeanor vacant, his body translucent. Scrooge was shook. He brushed past the ghostly figure and hurried toward the elevator.

Satisfied he had too many glasses of wine with dinner, Scrooge shut the door and settled in for the night. Suddenly, Siri, Alexa, Cortana and Google joined together to make a horrific AI cacophony.

This was followed by a clanking noise, as if someone were dragging a heavy chain over the hardwood floors. Scrooge remembered to have heard that ghosts in haunted houses were described as dragging chains.

The bedroom door flew open with a booming sound and then he heard the noise much louder, coming straight towards his door.

His color changed when he saw the same face. The very same. Drucker, in his suit, drew a chain clasped about his middle. It was made of servers, CPUs, and endless dongles.

Scrooge fell upon his knees, and clasped his hands before his face. “Mercy!” he said. “Dreadful apparition, why do you trouble me?”

“I wear the chain I forged in life. Oh! Captive, bound, and double-ironed,” cried the phantom.

“But you were always a good man of business and operations, Peter,” faltered Scrooge, who now began to apply this to himself.

“Business!” cried the Ghost, wringing his hands again. “Mankind was my business. Empathy, compassion and shared documentation were all my business. Operations was but a drop of water in the ocean of my business! You will be haunted,” resumed the Ghost, “by three spirits.”

“I—I think I’d rather not,” said Scrooge.

“Without their visits,” said the Ghost, “you cannot hope to shun the path of waterfall development, silos and your startup’s slow descent into obscurity. Expect the first tomorrow.”

With that, Drucker’s ghost vanished.

STAVE II

It was dark when Scrooge awoke from his disturbed slumber. So dark he could barely distinguish transparent window from opaque wall. The chimes of a neighboring church struck the hour and a flash lit up the room in an instant. Scrooge stared face-to-face with an unearthly visitor.

It was a strange figure—small like a child, but old. Its hair was white with age but it’s face had not a wrinkle.

“Who, and what are you?” Scrooge demanded.

“I am the Ghost of DevOps Past.”

“Long past?” inquired Scrooge, observant of its dwarfish nature.

“DevOps is only like 10 years old. Do you even read Hacker News?” He paused. “Rise! And walk with me!”

The Ghost took his hand and together they passed through the wall, and stood upon an open convention floor.

“Good heaven! I went to this conference!” said Scrooge.

“The open space track is not quite deserted,” said the Ghost. “A solitary man, neglected by his friends, is left there still.”

In a corner of the conference, they found a long, bare, melancholy room. In a chair, a lonely man was reading near the feeble light of his phone. Scrooge wept to see poor Andrew Clay Shafer in a room alone.

“Poor man!” Scrooge cried. “I wish,” Scrooge muttered, putting his hand in his pocket, “but it’s too late now.”

“What is the matter?” asked the Spirit.

“Nothing,” said Scrooge. “Nothing. There was a developer who asked about ops yesterday. I should like to have given him something. That’s all.”

The Ghost smiled thoughtfully, and waved its hand: saying as it did so, “Let us see another conference!”

The room became a little darker. The panels shrunk and the windows cracked. They were now in the busy thoroughfares of a city, where shadowy passengers passed along the narrow streets and Medieval architecture.

The Ghost stopped at a certain door and ushered Scrooge in. “Why it’s old Patrick Dubois! Bless his heart.”

“Another round!” Dubois announced.

“This is the first DevOpsDays afterparty in Ghent,” explained the Ghost.

Away they all went, twenty couple at once toward the bar. Round and round in various stages of awkward grouping.

“Belgians sure know how to party,” said Scrooge.

“Sure. But it’s a tech conference so it’s still pretty awkward,” remarked the Ghost. “A small matter,” it continued, “to make these silly folks so full of gratitude. It only takes a t-shirt and a beer.”

“Small!” echoed Scrooge.

“Why! Is it not? He has spent but a few dollars of your mortal money. Is that so much he deserves this praise?”

“It isn’t that, Spirit. He has the power to render us happy or unhappy. To make our service light or burdensome. Say that his power lies in words and looks, in things so slight and insignificant that it is impossible to add and count ’em up. The happiness he gives, is quite as great as if it cost a fortune.”

He felt the Spirit’s glance, and stopped.

“What is the matter?” asked the Ghost.

“Nothing particular,” said Scrooge.

“Something, I think?” the Ghost insisted.

“No,” said Scrooge, “No, I should like to be able to say a word or two to my employees just now. That’s all.”

“My time grows short,” observed the Spirit. “Quick!”

It produced an immediate effect. Scrooge saw himself. He was not alone but with two former employees. The tension in the room was overwhelming.

“The code didn’t change,” explained the developer. “There’s no way we caused this out—”

“That’s bullshit!” interjected the SRE. “There was a deploy 10 minutes before the site went down. Of course that’s the issue here. These developers push out crappy code.”

“At least we can code,” replied the developer, cruelly.

“No more,” cried Scrooge. “No more. I don’t wish to see it.”

But the relentless Ghost pinioned him in both his arms and forced him to observe.

“Spirit!” said Scrooge in a broken voice, “remove me from this place. Remove me. I cannot bear it!”

He turned upon the Ghost, and seeing that it looked upon him with a face in which some strange way there were fragments of all the faces it had shown him. “Leave me! Take me back. Haunt me no longer!” Scrooge wrestled with the spirit with all his force. Light flooded the ground and he found himself exhausted. Overcome with drowsiness, Scrooge fell into a deep sleep.

STAVE III

Awaking in the middle of a prodigiously tough snore and sitting up, Scrooge was surprised to be alone. Now, being prepared for almost anything, he shuffled in his slippers to the bedroom door. The moment Scrooge’s hand was on the lock, a strange voice called him by his name and bade him enter. He obeyed.

It was his own room, there was no doubt. But it had undergone a surprising transformation. The walls and ceiling were hung with living green and bright gleaming berries glistened.

“Come in!” exclaimed the Ghost. “Come in and know me better, man!”

Scrooge entered timidly and hung his head. The Spirit’s eyes were clear and kind but he didn’t like to meet them.

“I am the Ghost of DevOps Present,” said the Spirit. “Look upon me!”

Scrooge reverently did so. It was clothed in one simple green robe, bordered with white fur. This garment hung so loosely on the figure that its capacious breast was bare.

“You have never seen the like of me before!” exclaimed the Ghost.

“Never,” Scrooge answered. “But you may want to cover up a bit. This is breaking some kind of code of conduct and men are getting in trouble for this kind of thing these days.”

“Touch my robe!”

“OK, this is getting awkward. And inappropriate. Seriously, you can’t expose yourself like this. House of Cards was canceled because of people like you. It’s not cool, man.”

“Touch my robe!” the Ghost bellowed.

Scrooge did as he was told, and held it fast.

The room vanished instantly. They found themselves in the conference room at Humbug.ly. The Ghost sprinkled incense from his torch on the heads of the employees sat around the table. It was a very uncommon kind of torch, for once or twice when there were angry words between some men, he shed a few drops of water on them from it, and their good humour was restored directly.

“Is there a peculiar flavor in what you sprinkle from your torch?” asked Scrooge.

“There is. My own.”

“OK, buddy, we gotta work on the subtle sexual harassment vibe you’re working with.”

Scrooge’s employees at the table began to argue about CI/CD, pipelines and testing for the new Tiny Tim app.

“We need to be deploying every ten minutes. At a minimum. That’s what Netflix does,” said Steve, confidently.

“We use Travis CI. What if we had commits immediately deploy to production?” asked Tony.

“Um, that’s a terrible idea. You want the developers tests be the only line of defense against site outages? I don’t want to be on call during that disaster.”

“There’s no reason to get an attitude, Steve.”

“Well, you’re suggesting that developers should be trusted to deploy their own code.”

“That’s exactly what I’m saying.”

“That’ll never work. QA and security need to review everything before it’s pushed out.”

“Yea, but there’s this whole concept of moving things to the left. Where ops, security, QA are all involved in feature planning and the developer architects the code with their concerns in mind. That way, we don’t have Amy spending a full week on a feature only to have security kick it back.”

“I think that’s exactly how it should work. Developers need to code better.”

“‘Coding better’ is not actionable or kind. And that kind of gatekeeping process creates silos and animosity. People will start to work around each other.”

“We’ll just add more process to prevent that.”

“Never underestimate someone in tech’s ability to use passive-aggressiveness in the workplace. Listen. Scrooge wants Tiny Tim to be reliable, agile, testable, maintainable and secure.”

“Well, that’s impossible.”

“Pfff,” remarked Scrooge. “You’re fired.”

The Ghost sped on. It was a great surprise to Scrooge, while listening to the moaning of the wind, and thinking what a solemn thing it was to move on through the lonely darkness over an unknown abyss, whose depths were secrets as profound as Death. This was the ever-present existential crisis of life. It was a great surprise to Scrooge, while thus engaged, to hear a hearty laugh.

“Ha, ha! Ha, ha, ha, ha!” laughed Scrooge’s nephew.

“He said that Christmas was a humbug, as I live!” cried Scrooge’s nephew. “He believed it too! I am sorry for him. I couldn’t be angry with him if I tried. Who suffers by his will whims! Himself, always. He takes it into this head to dislike us. No one at Humbug.ly likes him. The product owner for Tiny Tim is about to quit and Scrooge has no idea!”

Scrooge was taken aback. He built a great team. They loved him. Adored him, even. Or so he thought. It’s true he hadn’t taken time to talk to any of them in several months, but everything was going so well. They were only two months from the launch of Tiny Tim.

His nephew continued, “We’re all underpaid and overworked. Scrooge is constantly moving the goalpost. He expects us to be perfect.”

The Ghost grew older, clearly older.

“Are spirits’ lives so short?” asked Scrooge.

“My life upon this globe is very brief,” replied the Ghost. “It ends tonight. Hark! The time is drawing near.”

The Ghost parted the folds of its robe. “Look here. Look, look, down here!”

“You. Have. To. Stop. With. This.” sighed Scrooge.

From its robe it brought two children. Scrooge started back, appalled at the creatures. “Spirit! Are they yours?”

“They are Op’s,” said the Spirit, looking down upon them. “And they cling to me. This boy is Serverless. This girl is Lambda.”

The bell struck twelve. Scrooge looked about him for the Ghost and saw it not. Lifting up his eyes, he beheld a solemn Phantom, draped and hooded, coming, like a mist along the ground, towards him.

STAVE IV

The Phantom slowly, gravely, silently approached. When it came near him, Scrooge bent down upon his knee; for in the very air through which this Spirit moved it seemed to scatter gloom and mystery.

It was shrouded in a deep black garment, which concealed its head, its face, its form, and left nothing of it visible save on outstretched hand.

“I am in the presence of the Ghost of DevOps Yet To Come?” said Scrooge.

The Spirit answered not, but pointed onward with its hand.

“Lead on!” said Scrooge. “The night is waning fast, and it is precious time to me, I know. Lead on, Spirit!”

The Phantom moved away as it had come towards him. Scrooge followed in the shadow of its dress, which bore him up, he thought, and carried him along.

They scarcely seemed to enter the city. For the city rather seemed to spring up about them and encompass them of its own act. The Spirit stopped beside one little knot of business men. Observing that the hand was pointed to them, Scrooge advanced to listen to their talk.

“No,” said a great fat man with a monstrous chin, “I don’t know much about it, either way. I only know it’s dead.”

“When did it happen?” inquired another.

“Yesterday, I believe.”

“How much of a down round was it? Just a stop gap?”

“No. Investors have lost faith. Scrooge sold this Tiny Tim app hard. Bet all of Humbug.ly on disrupting the DevSecDataTestOps space. He could only raise half of what they did in Series A.”

“It’s over,” remarked another.

“Oh yea, he’s done.”

This was received with a general laugh.

The Phantom glided on and stopped once again in the office of Humbug.ly — it’s finger pointed to three employees fighting over who could take home the Yama cold brew tower — one nearly toppling it in the process. There were movers carrying standing desks out and employees haggling over their desks and chairs.

Scrooge watched as his office — his company — was systematically dismantled, broken down and taken away piece-by-piece. His own office was nearly empty save for a single accounting box — a solemn reminder of Scrooge’s priorities in his work.

“Spectre,” said Scrooge, “something informs me that our parting moment is at hand. I know it, but I know not how.”

The Ghost of DevOps Yet To Come conveyed him, as before. The Spirit did not stay for anything, but went straight on, as to the end just now desired, until besought by Scrooge to tarry for a moment.

The Spirit stopped; the hand was pointed elsewhere.

A pile of old computers lay before him. The Spirit stood among the aluminum graves, and pointed down to one. He advanced toward it trembling. The Phantom was exactly as it had been, but he dreaded that he saw new meaning in its solemn shape.

Scrooge crept towards it, and following the finger, read upon the screen, HUMBUG.LY — THIS WEBPAGE PARKED FREE, COURTESY OF GODADDY.COM.

“No, Spirit! Oh no, no!”

The finger was still there.

“Spirit!” he cried, tight clutching at its robe, “hear me! I am not the man I was. I will not be the man I have been but for this. I will honor DevOps in my heart, and try to keep it all the year. I will live in the Past, the Present and the Future. The Spirits of all three shall strive within me. I will not shut out the lessons they teach.”

Holding up his hands in a last prayer to have his fate reversed, he saw an alteration in the Phantom’s hood and dress. It shrunk, collapsed and dwindled down into a bedpost.

STAVE V

Yes! The bedpost was his own. The bed was his own, the room was his own.

“Oh Peter Drucker! Heaven, and DevOps be praised for this! I don’t know what to do!” cried Scrooge, laughing and crying in the same breath. Really, for a man who had been out of practice for so many years, it was a splendid laugh, a most illustrious laugh.

Scrooge hopped on Amazon with haste. He bought The Phoenix Project and The DevOps Handbook and had it delivered by drone within the hour.

“I’ll get a copy for every one of my employees!” exclaimed Scrooge.

He hopped into his Tesla and drove to his nephew’s apartment. Greeted by one of the 5 roommates, he asked to see his nephew.

“Fred!” said Scrooge.

“Why bless my soul!” cried Fred, “who’s that?”

“It’s I. Your uncle Scrooge. I have come to dinner. Will you let me in, Fred?”

The odd group had a twenty-something Christmas dinner and played Cards Against Humanity. Wonderful party, wonderful games, wonderful happiness!

But he was early at the office the next morning. If he could only be there first, and catch his employees coming in late!

The last employee stumbled in. “Hello!” growled Scrooge, in his accustomed voice, as near as he could feign it. “What do you mean by coming here at this time of day?”

“I am very sorry, sir. I am behind my time.”

“You are?” repeated Scrooge. “Yes. I think you are. Step this way, sir, if you please.”

“It’s only once a year, sir,” pleaded Bob. “It should not be repeated. Besides, we’re supposed to have flexible work hours. We have a ping-pong table for God’s sake!”

“Now, I’ll tell you what, my friend,” said Scrooge, “I am not going to stand this sort of thing any longer. And therefore,” he continued, leaping from his Aeron, “I am about to raise your salary! All of your salaries!”

Bob trembled. “Does… does this mean we’re doing that open salary thing?”

“No, don’t push it, Bob. That’s for hippies,” replied Scrooge. “A merry Christmas!”

Scrooge was better than his word. He did it all and infinitely more. The Tiny Tim app was finished on time, Humbug.ly adopted a DevOps culture, developers stopped being assholes and ops folks got more sleep.

Scrooge had no further experience with Spirits and it was always said of him that he knew how to keep DevOps well, if any man alive possessed the knowledge. May that truly be said of all of us! And so, Ops bless us, everyone!

December 14, 2017

Day 14 - Pets vs. Cattle Prods: The Silence of the Lambdas

By: Corey Quinn (@quinnypig)
Edited By: Scott Murphy (@ovsage)

“Mary had a little Lambda
S3 its source of truth
And every time that Lambda ran
Her bill went through the roof.”

Lambda is Amazon’s implementation of a concept more broadly known as “Functions as a Service,” or occasionally “Serverless.” The premise behind these technologies is to abstract away all of the infrastructure-like bits around your code, leaving the code itself the only thing you have to worry about. You provide code, Amazon handles the rest. If you’re a sysadmin, you might well see this as the thin end of a wedge that’s coming for your job. Fortunately, we have time; Lambda’s a glimpse into the future of computing in some ways, but it’s still fairly limited.

Today, the constraints around Lambda are somewhat severe.

  • You’re restricted to writing code in a relatively small selection of languages– there’s official support for Python, Node, .Net, Java, and (very soon) Go. However, you can shoehorn in shell scripts, PHP, Ruby, and others. More on this in a bit.
  • Amazon has solved the Halting Problem handily– after a certain number of seconds (hard capped at 300) your function will terminate.
  • Concurrency is tricky: it’s as easy to have one Lambda running as a time as it is one thousand. If they each connect to a database, it’s about to have a very bad day. (Lambda just introduced per-function concurrency, which smooths this somewhat.)
  • Workflows around building and deploying Lambdas are left as an exercise for the reader. This is how Amazon tells developers to go screw themselves without seeming rude about it.
  • At scale, the economics of Lambda are roughly 5x the cost of equivalent compute in EC2. That said, for jobs that only run intermittently, or are highly burstable, the economics are terrific. Lambdas are billed in Gigabyte-Seconds (of RAM).
  • Compute and IO scale linearly with the amount of RAM allocated to a function. Exactly what level maps to what is unpublished, and may change without notice.
  • Lambda functions run in containers. Those containers may be reused (“warm starts”) and be able to reuse things like database connections, or have to be spun up from scratch (“cold starts”). It’s a grand mystery, one your code will have to take into account.
  • There are a finite list of things that can trigger Lambda functions. Fortunately, cron-style schedules are now one of them. The Lambda runs
  • within an unprivileged user account inside of a container. The only place inside of this container where you can write data is /tmp, and it’s limited to 500mb.
  • Your function must fit into a zip file that’s 50MB or smaller; decompressed, it must fit within 250MB– including dependencies.

Let’s focus on one particular Lambda use case: replacing the bane of sysadmin existence, cron jobs. Specifically, cron jobs that affect your environment beyond “the server they run on.” You still have to worry about server log rotation; sorry.

Picture being able to take your existing cron jobs, and no longer having to care about the system they run on. Think about jobs like “send out daily emails,” “perform maintenance on the databases,” “trigger a planned outage so you can look like a hero to your company,” etc.

If your cron job is written in one of the supported Lambda languages, great– you’re almost there. For the rest of us, we probably have a mashup of bash scripts. Rejoice, for hope is not lost! Simply wrap your terrible shell script (I’m making assumptions here– all of my shell scripts are objectively terrible) inside of a python or javascript caller that shells out to invoke your script. Bundle the calling function and the shell script together, and you’re there. As a bonus, if you’re used to running this inside of a cron job, you likely have already solved for the myriad shell environment variable issues that bash scripts can run into when they’re called by a non-interactive environment.

Set your Lambda trigger to be a “CloudWatch Event - Scheduled” event, and you’re there. It accepts the same cron syntax we all used to hate but have come to love in a technical form of Stockholm Syndrome.

This is of course a quick-and-dirty primer for getting up and running with Lambda in the shortest time possible– but it gives you a taste of what the system is capable of. More importantly, it gives you the chance to put “AWS Lambda” on your resume– and your resume should always be your most important project.

If you have previous experience with AWS Lambda and you’re anything like me, your first innocent foray into the console for AWS Lambda was filled with sadness, regret, confusion, and disbelief. It’s hard to wrap your head around what it is, how it works, and why you should care. It’s worth taking a look at if you’ve not used it– this type of offering and the design patterns that go along with it are likely to be with us for a while. Even if you’ve already taken a dive into Lambda, it’s worth taking a fresh look at– the interface was recently replaced, and the capabilities of this platform continue to grow.

December 13, 2017

Day 13 - Half-Dead TCP Connections and Why Heartbeats Matter

By: Alejandro Brito Monedero (@ae_bm)

Edited By: J. Paul Reed (@jpaulreed)

We are living interesting times in the tech world, full of trendy technologies, like cloud computing, containers, schedulers, and serverless.

They’re all rainbows and unicorns when they work. But we can start to forget about the systems that support our abstractions until they break, and we have to give our best to fix them.

Some time ago in one of our multiple pub-sub systems, there were some processes publishing messages to a message broker. Those messages are consumed by a process running in a container. So far this isn’t too exotic; it’s easy to diagram out:

Pub Sub

For the most part, this system worked as expected. However, at one point, we received some alerts from the monitoring system. Those alerts reported that the broker had a lot of queued messages and no consumers connected. Our first reaction was to restart the consumer container and call it a day. But the error kept happening, always at the worst possible time.

While taking a closer look at the problem, we confirmed that the broker has no consumers to deliver the messages. The surprise came when we inspect the consumer container. It was still running and seemingly all was well, except we did notice it was blocked on the socket used to communicate with the broker. When we inspect the socket statistics inside the container’s network namespace, it showed a connection to the broker:

# ss -tpno
ESTAB      0      0               <container ip>:<some port>         <broker ip>:<broker port>

Upon seeing this, our reaction could pretty much be summed up as:

The problem seemed to be that the connection state between the broker and consumer was not synchronized. In the host network namespace (where the broker is running), the status showed there weren’t any TCP connection from the container. Instead in the container network namespace there is an established connection to the broker. Our dear RFC 793 mentions this situation Half-Open Connections and Other Anomalies (emphases mine):

An established connection is said to be “half-open” if one of the TCPs has closed or aborted the connection at its end without the knowledge of the other, or if the two ends of the connection have become desynchronized owing to a crash that resulted in loss of memory. Such connections will automatically become reset if an attempt is made to send data in either direction. However, half-open connections are expected to be unusual, and the recovery procedure is mildly involved.

If at site A the connection no longer exists, then an attempt by the user at site B to send any data on it will result in the site B TCP receiving a reset control message. Such a message indicates to the site B TCP that something is wrong, and it is expected to abort the connection.

After that nice enlightenment, we started to think of possible causes for that “desynchronization.” Options that came to mind included:

  • the broker restarting or crashing
  • man-in-the-middle attack (MITM)
  • A grumpy kernel (iptables, ebtables, bridges, etc)
  • A grumpy container engine
  • Some Lovecraftian horror show

To determine which it was, we first checked if the broker has crashed or has been restarted. Upon inspection, we found it’d been running for a long time and other systems using it were working normally. So it wasn’t a problem with the broker.

The kernel iptables and bridge didn’t show anything weird. A MITM attack seemed a bit exotic. The other options were hard to prove, and we thought it wouldn’t be very professional of us to blame the container system without any evidence. ;-)

While trying to think of other causes it could be, we kept tcpdump running on one of consumer containers. tcpdump captured an RST message sent from the container IP to the broker in response to the broker sending a data message to the consumer container after a long period of inactivity. The weird thing is that network traffic never reached the container, neither the RST originated from the container. Maybe the MITM attack wasn’t such an exotic possibility after all?!

Meanwhile, while trying to re-create the problem end state and work toward making our our systems resilient to this situation: we used iptables to drop or reset traffic from the broker to the container after the container connected to the broker. Both methods allowed us to observe the same end-state we were getting in production, confirming the container never learns that the broker connection is lost. Figuring out how to find how to learn that your peer is down even if the TCP connection state is established proved difficult. But after some Internet searching, we found RFC 1122’s section on TCP Keep-Alives (again emphases mine):

Implementors MAY include “keep-alives” in their TCP implementations, although this practice is not universally accepted. If keep-alives are included, the application MUST be able to turn them on or off for each TCP connection, and they MUST default to off.

Keep-alive packets MUST only be sent when no data or acknowledgement packets have been received for the connection within an interval*. This interval MUST be configurable and MUST default to no less than two hours.

DISCUSSION:

A “keep-alive” mechanism periodically probes the other end of a connection when the connection is otherwise idle, even when there is no data to be sent. The TCP specification does not include a keep-alive mechanism because it could:
(1) cause perfectly good connections to break during transient Internet failures; (2) consume unnecessary bandwidth (“if no one is using the connection, who cares if it is still good?”); and (3) cost money for an Internet path that charges for packets.

A TCP keep-alive mechanism should only be invoked in server applications that might otherwise hang indefinitely and consume resources unnecessarily if a client crashes or aborts a connection during a network failure.

Translation: distributed systems are fun… and determining whether a connection is still valid is often the cherry on top.

But before trying to poke at the TCP stack, we kept investigating. We found out that AMQP supports heartbeats, and they are used to check if a connection is still valid. The library we were using had this option disabled by default, which explains why the container was blocked and waiting instead of trying to reconnect to the broker. To make things worse because the container is a consumer, it never sends data to the broker. If the container has sent data using the same socket it could detect on its own whether the connection was still valid.

To fix this, we evaluated two solutions:

  • The TCP keep-alive fix was the fastest to implement, but we didn’t like it because it deletgated detection of the broken connection to the kernel TCP implementation. Also we didn’t really want to mess with kernel socket options.
  • For an alternative, we ran some tests with other applications and they handled it at the application level (thank you Bandwagon effect). Through this testing, we found we could change the library to activate the AMQP heartbeats. It took more time, but it felt like a better solution to use the mechanisms provided by the AMQP protocol.

But what about the MITM attack we thought we were seeing?

First some context: we run periodic, short-lived helper containers. We have observed with tcpdump that when a container starts, it announces some ICMPv6 memberships. Also by default, the container network namespace is attached to a network bridge. The network bridge uses a cache to associate addresses with its respective port, like a network switch. It populates this table when it sees network traffic; as you might imagine, if there isn’t any traffic for some time, the data in the table becomes stale.

The MITM “attack” happens when in a period without traffic between the broker and the container, the bridge cache is stale and a short live container is launched, there is a chance for it to get the same IP address the consumer container has. This new container changes the bridge cache, then if the broker sends a message to the consumer, the bridge delivers it to the new container, who then answers with a TCP RST because it doesn’t have a TCP connection with the broker. Finally the broker receives the TCP RST and aborts its connection with the consumer. The magic of giving the same IP address to different containers.

A picture is worth a thousand words.

Without the MITM

With the MITM

Ultimately, the problem turned out to be one of the most exotic possibilities we had come up with, making us feel pretty:

Conclusion

Our programs must be prepared to handle network disruptions even when the network traffic doesn’t leave a single host and you are using containers. Remember: the network is not reliable! If we forget this, we will have a lot of “fun” with distributed systems bugs.

Perhaps more importantly: always remember that if your program never sends traffic on its TCP socket, you can’t be sure whether the connection is valid or if you will end up waiting for a message that will never arrive.

There are two solutions to avoid this situation: the first is to use TCP keepalives and delegate detection of stale connections to the OS; The other is to implement or use a heartbeat mechanism at a higher layer in the protocol.

Both alternatives have their pros and cons, so you’ll need to find which one is best for your team and hte distributed system you run. But now that you’ve seen a story where TCP half-open connections, “anomelies,” and keep-alives all worked together, you’ll know that MITM “attack” might not be such an exotic cause of the problem, even if it’s not an attacker trying to get in, but rather your own the kernel.

Extras

While preparing this post, I found this article, which discusses half open connections; it would have been handy when explaining this problem to my coworkers.

December 12, 2017

Day 12 - Monitoring Postgres Replication Lag

By: Kathryn Exline (@kathryn_ex)
Edited By: Baron Schwartz (@xaprb)

Have you created replicas of your PostgreSQL databases? I am going to assume you are a good database steward and answered that question with a resounding “YES INDEEDLY DO!” Ned Flanders style. If not, I recommend taking the time to be kind to your future self and do so as soon as possible. We won’t talk about how to do that here, but you can find details on how to configure replication in the PostgreSQL documentation, the PostgreSQL wiki, and around the internet.

With your trusty replicas in place, make sure you take the time to properly monitor your clusters. One of the most important metrics of replication health, albeit seductively easy to over value, is “replication lag”. Before I show you a few simple queries to collect this value on your PostgreSQL clusters, let us briefly talk about replication in PostgreSQL. If you are already familiar with the concept of replication and how it is implemented in PostgreSQL, feel free to skip ahead to the “Why Care About Lag?” section.

The Basics: Replication and WAL

Replication is a mechanism where data from one database (a “primary”) is copied to another secondary database (a “replica” or “standby”), keeping it in sync. Most databases have built-in mechanisms to support this feature. After you configure your primary PostgreSQL database for your service, you should create one or more replica databases in case you lose your primary database or decide you want to offload some operations from the primary. Generally, you initialize a replica with a snapshot of the primary and then it stays up to date by fetching and replaying the primary’s transactions.

PostgreSQL implements replication via the Write-Ahead-Log or the “WAL” (pronounced “wall”, like the big icy thing in Game of Thrones). The notion of the WAL is not unique to PostgreSQL, and is similar to that of journaling in file systems. It ensures transactions are logged durably before they are committed, so updates can be recovered and replayed in the case of a crash. Aside from crash recovery, PostgreSQL leverages the WAL for internal performance gains and built-in replication support.

The WAL is a collection of 16MB binary files located in the pg_xlog directory of your data directory. Each time the database gets a transaction that requires changing any data, it appends a record of the transaction to the most recently created WAL segment file and assigns the record with a Log Sequence Number (LSN) to note its position in the WAL. I explicitly say position and not time because as the term Log Sequence Number suggests, the WAL files and their individual records are based on a sequence-based timeline. Why? Because if you are processing a high volume of transactions, timestamps may not be unique or granular enough to validate that your transactions are executed in the correct serial order. Not to mention time is full of nasty tricksies. This will be important later when we look at the queries you can use to find your primary’s and replicas’ position in the WAL.

PostgreSQL uses the WAL to make replicas of the primary in one of two ways. The latest and greatest is via streaming replication, where each WAL log record is sent to the replica as quickly as possible to be replayed. By default, this is done asynchronously so the replica can process the record without delaying the commit on the primary; however, PostgreSQL also supports synchronous replication where a transaction on the primary must wait until the WAL record is committed on both the primary and and the replica before considering the transaction successful.

The second and older option is via log-shipping where it ships one full WAL segment file (16MB) at a time from the primary to a replica. This generally results in higher replication lag since the replica will not receive the WAL records until the file is completely filled. Streaming replication is best for most use cases, but I recommend reading the PostgreSQL documentation around log-shipping standby servers for in-depth explanations of these two options.

Why Care About Lag?

Replication lag is the replica’s distance behind the primary in the sequential timeline. The time it takes to copy data from the primary to a replica, and apply the changes, can vary based on a number of factors including network time, replication configuration, and activity on both the primary and replicas. Unsurprisingly, I have seen replication lag spike on several occasions due to network issues. In another case, I saw replication lag spike on a replica that was not able to find and recover a WAL file from an archiving node, and it quietly fell out of date. The potential causes are widespread and I have found that replication lag is often an indicator that something is subtly failing or behaving unexpectedly.

Ultimately, it is safe to assume that there will be some amount of lag on any replica. But why do you need to know the replication lag in your clusters?

Disaster Recovery

In most scenarios where a primary database is lost, users want to promote the most up to date replica to ensure minimal data loss. You and your tooling can measure the lag to select the optimal replacement for the primary.

Service Strategies and Optimizations

If you connect all of your clients to the primary, you will eventually overload your database. When this happens, a common technique is to direct some read-only queries to replicas; however, If you don’t build your service to be aware of, and tolerate, replication lag, then your users will experience inconsistent behavior from your service. Knowing the typical replication lag of your replicas will help you strategize which services can still function in spite of potential lag.

Debugging and Observation

Just as measuring latency in an HTTP request can indicate an underlying issue, unusually high replication lag can indicate an issue with your databases. Unfortunately replication lag in isolation rarely informs users of the specific underlying problem, but it is a broad indicator of several issues and is another data point in your observability toolbelt.

How To Monitor Lag: Get Your “See Lags”

Now that we have a handle on the importance of monitoring your replication lag, let’s dive into two ways to measure replication lag.

By WAL Location

The most accurate way to determine the lag is to compare the current WAL location on the primary with the last WAL location received by the standby. To find the LSN value of the current WAL location in Postgres versions older than 10.x, run the following on the primary:

=# select pg_current_xlog_location();

In Postgres 10.x, you’ll need to use a newer function:

=# select pg_current_wal_lsn();

You should get an LSN value which looks like the following:

 pg_current_xlog_location 
--------------------------
 9C/1E306050
(1 row)

To find the LSN value of the last WAL location received and synced to disk by the standby, run the following on the replica. Once again, there’s a pre-10 syntax and a newer version for Postgres 10.x:

=# -- in Postgres 9.x
=# select pg_last_xlog_receive_location();

=# -- in Postgres 10.x
=# select pg_last_wal_receive_lsn();

You should get a similar record as the previous function:

 pg_current_xlog_location 
--------------------------
 95/75619450
(1 row)

Note that these functions denote what WAL position the replica has received from the primary, but not what it has applied to bring the replica’s copy into sync with the primary. There could be a difference between these two values. To find out what the replica has replayed, use the following functions on the replica:

=# -- in Postgres 9.x
=# select pg_last_xlog_replay_location();

=# -- in Postgres 10.x
=# select pg_last_wal_replay_lsn();

You can determine whether the replica is at the same point in the WAL as the primary by comparing the values of what’s been committed on the primary and what’s been received or replayed on the replica. The disadvantage of using WAL position is that, despite being an accurate representation of lag, it is difficult for humans to understand what an LSN difference really means. I have seen clever scripts that convert LSN’s to the byte position in the WAL and take the difference of these values, but there is an easier option that leverages another built-in function to approximate time lag.

By Time Difference

I told you earlier that time was tricky and the WAL is based on a sequence, but timestamps are more readable to humans and ingestible by external tools than the WAL location values. PostgreSQL can extract the timestamp of a given WAL location, allowing you to compare the timestamp of the last played transaction in the WAL with the current time using the following query on your replica:

=# -- Same in both Postgres 9.x and 10.x 
=# select now() - pg_last_xact_replay_timestamp();

This value needs to be read with additional context and taken with a grain of salt. It is meant to be an approximation of lag and should be treated as such. I find this query most useful to inject into my time-series observability metrics or to toss in a terminal pane when running operations that might affect lag. If you are selecting a replica to replace a failed primary, you should use the LSN instead of the approximate timestamp of the LSN.

Other Helpful Queries and Tools

You can run the following to determine whether you are interacting with the primary or a replica. A replica will return ‘t’ and the primary will return ‘f’:

=# select pg_is_in_recovery();

You can also translate the LSN value returned by the functions mentioned above to the name of the WAL file name within your pg_xlog directory using:

=# -- In Postgres 9.x
=# select pg_xlogfile_name(pg_last_xlog_receive_location());

=# -- In Postgres 10.x
=# select pg_walfile_name(pg_last_wal_receive_lsn());

If you are curious about what a WAL file actually looks like, PostgreSQL introduced the pg_xlogdump tool in version 9.3 to convert the contents of the binary WAL file into human readable form. Note this tool was renamed to pg_waldump in version 10.0 and is intended for educational purposes only.

Beyond the Queries

If your databases run on a cloud platform, your provider may already provide these metrics for you. For example, AWS Cloudwatch provides the ReplicaLag metric and GCP provides the replication metric. Finally, whether you use external tooling to monitor your replication lag or write your own monitoring plugins, you need to consider how you actually use these metrics.

As we discussed earlier, replication lag is a helpful metric and provides additional data points when making decisions about your services, but think long and carefully before alerting or paging around replication lag. You probably don’t want to. Replication lag is susceptible to a variety of factors, some of which are not actionable or inherently wrong, and it varies enough that you could find yourself bogged down in tuning alerting thresholds or developing complex anomaly detection. If you do choose to page on this value, make sure you embed plenty of headroom in your thresholds, provide context around potential lag causes in your alerting tools, and give your on-call rotation a few extra high-fives.

ADDITIONAL READING