Tuesday, 15 September 2015

Taking a Nova Compute host down for maintenance

There are times when you want to take an OpenStack Nova compute host down, for example to upgrade the host's OS or hardware. Of course, in a perfectly cloudy world you'd just pull the plug and your massively distributed app would automatically reconfigure itself around the missing bits. In practise, though, you might want something a little less disruptive.

I wrote nova-compute-maintenance to do a best-effort evacuation of a nova compute host prior to taking it down. This tool first disables the target nova compute service in the scheduler, which will prevent any new instances from being scheduled to it. Then it attempts to live migrate all instances on the host to other hosts, leaving the decision about where to the scheduler.

Usage is simple. It uses the python nova client library, and takes its authentication credentials from the environment the same way the nova command line client does:

$ source keystonerc_admin

For simplicity, they can only be specified via the environment, not via the command line. It takes a small number of command line options:

$ usage: nova-compute-maintenance.py [-h]
         [--max-migrations MAX_MIGRATIONS]
         [--poll-interval POLL_INTERVAL]

At its simplest, the invocation is just:

$ ./nova-compute-maintenance.py compute-host-1.example.com

The tool is quite chatty. It will initially display a list of all instances it found:

Found instances on host:
  foo-7ac59434-1e45-47a3-bc84-8e39dd9562e8(7ac59434- ...
  foo-16d59bc9-c072-4342-bd85-18fa3b8aa47a(16d59bc9- ...
  foo-4a8d3e3d-de86-4bf3-83f6-9d57ba76a7af(4a8d3e3d- ...
  foo-8a729a41-b59f-470d-ba69-dfa5adae6384(8a729a41- ...
  foo-d8dccfe4-2ec7-4e5e-98d0-de453ef2cae8(d8dccfe4- ...
  foo-325f5d60-ea64-4141-9c20-d74cb7796578(325f5d60- ...
  foo-3e8ce1b7-3ffb-4cbd-92b8-0139ae726f1a(3e8ce1b7- ...
  foo-6293c837-f40d-40b9-b90c-66dc34acc114(6293c837- ...

It will initiate up to a fixed number of migrations at any one time. By default this is 2, but this can be adjusted for the capabilities of your system with the --max-migrations argument. It polls nova regularly to monitor the status of these migrations, and start new ones if required. It displays its current status every time it polls:

  foo-7ac59434-1e45-47a3-bc84-8e39dd9562e8(7ac59434- ...
  foo-16d59bc9-c072-4342-bd85-18fa3b8aa47a(16d59bc9- ...

On completion it displays success or failure. In this case the evacuation failed. There is 1 instance left on the host, and it is in the ACTIVE state.

Failed to migrate the following instances:
  foo-6293c837-f40d-40b9-b90c-66dc34acc114(6293c837- ...: ACTIVE
See logs for details

The tool is idempotent, so if it fails it's completely safe to run it again:

Found instances on host:
  foo-6293c837-f40d-40b9-b90c-66dc34acc114(6293c837- ...
  foo-6293c837-f40d-40b9-b90c-66dc34acc114(6293c837- ...
Success: No instances left on host

You can test the success or failure of the script by its exit code, which follows the usual convention: zero for success, non-zero for failure.

The tool is conservative by default: it will not do anything disruptive to an instance. This means that there are certain instances which it cannot handle automatically. These include instances which are paused, being rescued, or in the error state. If the host has any instances in these states, the tool will migrate all other instances, but leave these in place. As above, the tool will report failure and list all remaining instances and their states.

There is 1 case where the tool will disrupt an instance. If you specify --cold-fallback on the command line and it fails to live migrate an instance 3 times, it will fall back to trying a cold migration. This will cause the instance to be shut down for the duration of the migration. By default, if live migration fails the tool will leave it alone and report it as a failure.

I have developed this tool against Red Hat Enterprise Linux OpenStack Platform 5, which is based on OpenStack Icehouse. I would expect it to work against subsequent versions, too.

N.B. This tool's functionality overlaps significantly with the host-evacuate-live command of the Nova client, although it is considerably more robust. It is my intention to roll the functionality of this tool into Nova itself, or failing that a more robust command in the Nova client. This external tool is intended to bridge the gap until that lands.