Recently I had one of the nodes in a Ganeti cluster go down because of a faulty hard drive. Normally we would have RAID on machines in our ganeti clusters, but this particular machine didn’t.  Having a machine go offline like that would usually be a big deal, but with ganeti and DRBD this isn’t the case usually.

After I triaged the situation and decided that the HDD on the machine node3 was a lost cause, I decided to see what ganeti showed as the situation. Below is what I found:

# gnt-cluster verify
* Verifying global settings
* Gathering data (3 nodes)
* Verifying node status
  - ERROR: node node1.osuosl.bak: ssh communication with node 'node3.osuosl.bak': ssh problem: exited with exit code 255 (no output)
  - ERROR: node node1.osuosl.bak: tcp communication with node 'node3.osuosl.bak': failure using the primary and secondary interface(s)
  - ERROR: node node2.osuosl.bak: ssh communication with node 'node3.osuosl.bak': ssh problem: exited with exit code 255 (no output)
  - ERROR: node node2.osuosl.bak: tcp communication with node 'node3.osuosl.bak': failure using the primary and secondary interface(s)
  - ERROR: node node3.osuosl.bak: while contacting node: Error 7: Failed connect to 10.1.0.179:1811; Success
* Verifying instance status
  - ERROR: node node3.osuosl.bak: instance vm1.osuosl.org, connection to secondary node failed
  - ERROR: node node3.osuosl.bak: instance vm2.osuosl.org, connection to secondary node failed
  - ERROR: node node3.osuosl.bak: instance vm3.osuosl.org, connection to secondary node failed
  - ERROR: instance vm4.osuosl.org: instance not running on its primary node node3.osuosl.bak
  - ERROR: node node3.osuosl.bak: instance vm4.osuosl.org, connection to primary node failed
  - ERROR: instance vm5.osuosl.org: instance not running on its primary node node3.osuosl.bak
  - ERROR: node node3.osuosl.bak: instance vm5.osuosl.org, connection to primary node failed
* Verifying orphan volumes
* Verifying orphan instances
* Verifying N+1 Memory redundancy
  - ERROR: node node3.osuosl.bak: not enough memory on to accommodate failovers should peer node node1.osuosl.bak fail
  - ERROR: node node3.osuosl.bak: not enough memory on to accommodate failovers should peer node node2.osuosl.bak fail
* Other Notes
 - WARNING: Communication failure to node node3.osuosl.bak: Error 7: Failed connect to 10.1.0.179:1811; Success
* Hooks Results
  - ERROR: node node3.osuosl.bak: Communication failure in hooks execution: Error 7: Failed connect to 10.1.0.179:1811; Success

That’s a lot of information to just say one of the nodes is offline. To summarize, this is what Ganeti is saying:

  • node1 & node2 can’t talk to node3
  • node3 isn’t responding to the master node
  • vm1, vm2, vm3′s secondary drbd connection failed
  • vm4 & vm5 is not running
  • node3 doesn’t have enough memory to deal with failovers (probably because ganeti can’t see its resources)
  • node3 connections failure

Needless to say, node3 is down. Now lets mark node3 offline and see what ganeti shows.

# gnt-node modify -O yes node3
 - WARNING: Communication failure to node node3.osuosl.bak: Error 7: Failed connect to 10.1.0.179:1811; Success
# gnt-cluster verify
* Verifying node status
* Verifying instance status
  - ERROR: instance vm1.osuosl.org: instance lives on offline node(s) node3.osuosl.bak
  - ERROR: instance vm2.osuosl.org: instance lives on offline node(s) node3.osuosl.bak
  - ERROR: instance vm3.osuosl.org: instance lives on offline node(s) node3.osuosl.bak
  - ERROR: instance vm4.osuosl.org: instance lives on offline node(s) node3.osuosl.bak
  - ERROR: instance vm5.osuosl.org: instance lives on offline node(s) node3.osuosl.bak
* Verifying orphan volumes
* Verifying orphan instances
* Verifying N+1 Memory redundancy
  - ERROR: node node3.osuosl.bak: not enough memory on to accommodate failovers should peer node node1.osuosl.bak fail
  - ERROR: node node3.osuosl.bak: not enough memory on to accommodate failovers should peer node osdv2.osuosl.bak fail
* Other Notes
  - NOTICE: 1 offline node(s) found.
* Hooks Results

That’s much easier to read and handle. At this point I’m ready to failover the instances that are offline.

# gnt-instance failover --ignore-consistency vm4
* checking disk consistency between source and target
* shutting down instance on source node
 - WARNING: Could not shutdown instance vm4.osuosl.org on node node3.osuosl.bak. Proceeding anyway. Please make sure node node3.osuosl.bak is down. Error details: Node is marked offline
* deactivating the instance's disks on source node
 - WARNING: Could not shutdown block device disk/0 on node node3.osuosl.bak: Node is marked offline
* activating the instance's disks on target node
 - WARNING: Could not prepare block device disk/0 on node node3.osuosl.bak (is_primary=False, pass=1): Node is marked offline
* starting the instance on the target node
# gnt-instance failover --ignore-consistency vm5

Now lets fix the secondary storage for the other instances.

# gnt-instance replace-disks -n node2 vm1
 - INFO: Old secondary node3.osuosl.bak is offline, automatically enabling early-release mode
Replacing disk(s) 0 for vm1.osuosl.org
STEP 1/6 Check device existence
 - INFO: Checking disk/0 on node1.osuosl.bak
 - INFO: Checking volume groups
STEP 2/6 Check peer consistency
 - INFO: Checking disk/0 consistency on node node1.osuosl.bak
STEP 3/6 Allocate new storage
 - INFO: Adding new local storage on node2.osuosl.bak for disk/0
STEP 4/6 Changing drbd configuration
 - INFO: activating a new drbd on node2.osuosl.bak for disk/0
 - INFO: Shutting down drbd for disk/0 on old node
 - WARNING: Failed to shutdown drbd for disk/0 on oldnode: Node is marked offline
      Hint: Please cleanup this device manually as soon as possible
 - INFO: Detaching primary drbds from the network (=> standalone)
 - INFO: Updating instance configuration
 - INFO: Attaching primary drbds to new secondary (standalone => connected)
STEP 5/6 Removing old storage
 - INFO: Remove logical volumes for 0
 - WARNING: Can't remove old LV: Node is marked offline
      Hint: remove unused LVs manually
 - WARNING: Can't remove old LV: Node is marked offline
      Hint: remove unused LVs manually
STEP 6/6 Sync devices
 - INFO: Waiting for instance vm1.osuosl.org to sync disks.
 - INFO: - device disk/0:  0.00% done, no time estimate
 - INFO: - device disk/0: 25.00% done, 2h 23m 24s remaining (estimated)
 - INFO: - device disk/0: 50.40% done, 47m 38s remaining (estimated)
 - INFO: - device disk/0: 76.40% done, 26m 46s remaining (estimated)
 - INFO: - device disk/0: 92.20% done, 7m 49s remaining (estimated)
 - INFO: - device disk/0: 100.00% done, 0s remaining (estimated)
 - INFO: Instance vm1.osuosl.org's disks are in sync.

By using --submit you are able to let the output go into the background. You can view the output in real-time by running gnt-job watch <job id>. I went ahead and told ganeti replace the secondary disks on the other two machines at the same time. Be careful running too many replace disk operations as you may run into disk I/O issues on the nodes.

Now there is another way I could have fixed this and would have required less steps by using gnt-node evacuate. This command allows you to move all the secondary storage from a single node to another node quickly instead of doing it vm-by-vm. The command probably would have looked something similar to this:

# gnt-node evacuate --force -n node2 node3 

Instead of specifying which node to migrate storage to, you can also use an IAllocator plugin to automatically pick which node to use. So the command above would have been:

# gnt-node evacuate --force -I hail node3 

After a few minutes I brought redundancy back into my cluster, instances back online, an with no data loss.

Ganeti rocks!

After nearly a month and a half (42 days) of development since 0.4 was released, the OSUOSL has released Ganeti Web Manager 0.5 today. This second release has some very nice new features included in it:

Read the full ChangeLog for more details.

noVNC Console

My favorite new feature by far is the inclusion of noVNC by default for VNC console access. This removes the Java requirement for your browsers and makes it much easier to use. It works the best using Chrome/Chromium but you can also use Firefox.

noVNC console

New Overview Page

I’m also excited about the new overview pages for users and admins. It makes it much easier to see the usage of your cluster(s) quickly. For users it will show some basic resource/quota usage.

New Overview Page

Upgrading

If you’re upgrading from 0.4 be sure to read the upgrading wiki page and go over the installation page again. We’ve added a few new requirements such as South for database migrations and Twisted for the new VNC Auth Proxy.

Be sure to also check out Peter’s blog post about the 0.5 release as well!

Lead OSUOSL Developer Peter Krenesky has written an excellent blog post going over how the permission system works in Ganeti Web Manager. A key feature I’m looking forward to using more at the OSUOSL is managing our clusters with the following scenarios:

  • Fully managed - users have no access at all.  Only admins can create, reboot, or modify.
  • Partially managed - users can’t create virtual machines, but they have some limited ability to manage them.
  • Self Service - users can create virtual machines on demand.  They can create and manage their own virtual machines as needed.
  • User Managed Cluster - a user has control of an entire cluster.

The permission system in GWM will enable Ganeti cluster admins the ability to manage each cluster and virtual machine in finer detail. Ganeti by itself doesn’t come with any sort of user access management system, nor should it really. It makes sense to build tools like GWM on top of Ganeti to deal with such situations. I hope to see more features and bug fixes related to the permissions and quota system.

I’d love to see some feedback on how we implemented the system and how we can improve it!

Ganeti Web Manager logo

After three months of development Ganeti Web Manager 0.4 has been released! This project has been developed primarily by the OSU Open Source Lab with help from the folks at GRNET and several Google GCI students. Ganeti Web Manager (GWM) is a Django-based web application that connects to the Ganeti Remote API. It allows Ganeti administrators access to the various common tasks along with incorporating a permission system. GWM has a long ways to go in terms of implementing more of the RAPI features and UI improvements but this first release should be enough to get people to start using it in production. You can download Ganeti Web Manager here.

Features in 0.4:

  • Caching system
  • Permissions system:
    • User & Group management
    • Per cluster/virtual machine permissions
  • Basic VM management: Create, Delete, Start, Stop, Reboot, VNC Console
  • SSH key feed (for a ganeti post-install hook)
  • Basic quota system
  • Import tools

Basic Installation Requirements

GWM has a fairly low requirement footprint and only requires a minimum amount of Django dependencies.

Currently Firefox and Chrome browsers should work well although know that IE will have issues. I certainly hope whoever is using this application has at least Firefox installed. You will need the Java browser plugin in order to the VNC console. The VNC console requires direct access to the VNC port on the VM but we are working with GRNET to add in a VNC Auth Proxy to get around that.

Ganeti compatibility:

  • >= 2.2.x – supported
  • 2.1.x – mostly supported
  • 2.0.x – unsupported but may work
  • 1.x – unsupported

Screenshots:

List all virtual machines on a cluster:

List VMs in a cluster

Creating a new virtual machine form:

Creating a new virtual machine

Virtual machine reation output dynamically updating:

VM Creation output

Virtual machine VNC console using the java client.

VM VNC Console

Upcoming Features

We have lots of features we would like to eventually implement in GWM. You can see many of them on our issue tracker but here’s a summary of notable features we plan to do.

I’m excited to see where Ganeti Web Manager goes. I plan to start rolling it out at the OSUOSL very soon and giving access to some of the projects we host. If you would like to become a contributor to the project, please check us out on IRC in #ganeti-webmgr on Freenode.

Check my blog and Peter’s blog for more updates soon on Ganeti Web Manager.

One of the many large projects I’m working on at the OSUOSL has been migrating all of our virtualization over to Ganeti and KVM. Needless to say its kept me from updating my blog but I hope to make up for it. I thought I would give a rundown of how we use Ganeti at the OSUOSL and where we plan to move forward from there.

So far we have 10 clusters ranging in size from single nodes up to 4 node clusters. Each node is running Gentoo and managed with our cfengine setup. There are approximately 120 virtual machines deployed across all the clusters with the majority (~70) in our production cluster of four nodes. Each node in the production cluster is running between 17 to 18 KVM instances.

Project Ganeti Clusters

Several hosted projects including OSGeo, phpBB, and ECF have their own clusters which we fully manage on the node level. It works well for them as they don’t have to worry about  maintaining the virtualization cluster while giving them the flexibility of deploying dedicated VMs on their own hardware. I’ve been recommending moving towards this direction for current projects and new projects we get requests for. So far it seems to be working well for both the OSUOSL and the projects we host.

Image Deployment

For deployment we use ganeti-instance-image which is something I wrote to help make deployments faster and more flexible. It uses various types of images (tarball, filesystem dump, qemu-img) to unpack a pre-made system and deploy it with networking, grub, and serial fully functional. Creating the images is currently a manual process but I have it semi-automated using kickstart and preseed config files for building systems quickly and predictably. The amazing part is deploying a fully functional VM in under one minute using ganeti-instance-image.

Web-based Management

An upcoming tool that the OSUOSL is working on is a web-based frontend for managing Ganeti clusters called Ganeti Web Manager. Its written using the django framework and connecting to Ganeti via its RAPI protocol. Our lead developer Peter Krenesky and many of our students have been hard at work on this project in the last month and a half.

Some of the goals of this project include:

  • Permission system for users and how they access the cluster(s)
  • Easy VM deployment and management
  • Console access
  • Empower VM users

We’re very close to making our first release of ganeti-webmgr which should include a basic set of features. We still have a lot to work on and I look forward to seeing how it evolves.