I’m going to be speaking at SCALE 9x this year and giving a session on Scalable Virtualization with Ganeti on Saturday February 26th at 6pm.  I will be going over the basics of what Ganeti is and how you use it. This session will be very similar to the ones I gave last year at Open Source Bridge and LinuxCon Boston.

If you want to meet me in person and talk about what’s going on at the Open Source Lab, Supercell, Ganeti,Gentoo, or just other random stuff, feel free to! I’ll be the only person coming from the OSUOSL but I’ll be sure to represent us the best that I can.

See you at SCALE9x in a few weeks!

Recently I had one of the nodes in a Ganeti cluster go down because of a faulty hard drive. Normally we would have RAID on machines in our ganeti clusters, but this particular machine didn’t.  Having a machine go offline like that would usually be a big deal, but with ganeti and DRBD this isn’t the case usually.

After I triaged the situation and decided that the HDD on the machine node3 was a lost cause, I decided to see what ganeti showed as the situation. Below is what I found:

# gnt-cluster verify
* Verifying global settings
* Gathering data (3 nodes)
* Verifying node status
  - ERROR: node node1.osuosl.bak: ssh communication with node 'node3.osuosl.bak': ssh problem: exited with exit code 255 (no output)
  - ERROR: node node1.osuosl.bak: tcp communication with node 'node3.osuosl.bak': failure using the primary and secondary interface(s)
  - ERROR: node node2.osuosl.bak: ssh communication with node 'node3.osuosl.bak': ssh problem: exited with exit code 255 (no output)
  - ERROR: node node2.osuosl.bak: tcp communication with node 'node3.osuosl.bak': failure using the primary and secondary interface(s)
  - ERROR: node node3.osuosl.bak: while contacting node: Error 7: Failed connect to 10.1.0.179:1811; Success
* Verifying instance status
  - ERROR: node node3.osuosl.bak: instance vm1.osuosl.org, connection to secondary node failed
  - ERROR: node node3.osuosl.bak: instance vm2.osuosl.org, connection to secondary node failed
  - ERROR: node node3.osuosl.bak: instance vm3.osuosl.org, connection to secondary node failed
  - ERROR: instance vm4.osuosl.org: instance not running on its primary node node3.osuosl.bak
  - ERROR: node node3.osuosl.bak: instance vm4.osuosl.org, connection to primary node failed
  - ERROR: instance vm5.osuosl.org: instance not running on its primary node node3.osuosl.bak
  - ERROR: node node3.osuosl.bak: instance vm5.osuosl.org, connection to primary node failed
* Verifying orphan volumes
* Verifying orphan instances
* Verifying N+1 Memory redundancy
  - ERROR: node node3.osuosl.bak: not enough memory on to accommodate failovers should peer node node1.osuosl.bak fail
  - ERROR: node node3.osuosl.bak: not enough memory on to accommodate failovers should peer node node2.osuosl.bak fail
* Other Notes
 - WARNING: Communication failure to node node3.osuosl.bak: Error 7: Failed connect to 10.1.0.179:1811; Success
* Hooks Results
  - ERROR: node node3.osuosl.bak: Communication failure in hooks execution: Error 7: Failed connect to 10.1.0.179:1811; Success

That’s a lot of information to just say one of the nodes is offline. To summarize, this is what Ganeti is saying:

  • node1 & node2 can’t talk to node3
  • node3 isn’t responding to the master node
  • vm1, vm2, vm3′s secondary drbd connection failed
  • vm4 & vm5 is not running
  • node3 doesn’t have enough memory to deal with failovers (probably because ganeti can’t see its resources)
  • node3 connections failure

Needless to say, node3 is down. Now lets mark node3 offline and see what ganeti shows.

# gnt-node modify -O yes node3
 - WARNING: Communication failure to node node3.osuosl.bak: Error 7: Failed connect to 10.1.0.179:1811; Success
# gnt-cluster verify
* Verifying node status
* Verifying instance status
  - ERROR: instance vm1.osuosl.org: instance lives on offline node(s) node3.osuosl.bak
  - ERROR: instance vm2.osuosl.org: instance lives on offline node(s) node3.osuosl.bak
  - ERROR: instance vm3.osuosl.org: instance lives on offline node(s) node3.osuosl.bak
  - ERROR: instance vm4.osuosl.org: instance lives on offline node(s) node3.osuosl.bak
  - ERROR: instance vm5.osuosl.org: instance lives on offline node(s) node3.osuosl.bak
* Verifying orphan volumes
* Verifying orphan instances
* Verifying N+1 Memory redundancy
  - ERROR: node node3.osuosl.bak: not enough memory on to accommodate failovers should peer node node1.osuosl.bak fail
  - ERROR: node node3.osuosl.bak: not enough memory on to accommodate failovers should peer node osdv2.osuosl.bak fail
* Other Notes
  - NOTICE: 1 offline node(s) found.
* Hooks Results

That’s much easier to read and handle. At this point I’m ready to failover the instances that are offline.

# gnt-instance failover --ignore-consistency vm4
* checking disk consistency between source and target
* shutting down instance on source node
 - WARNING: Could not shutdown instance vm4.osuosl.org on node node3.osuosl.bak. Proceeding anyway. Please make sure node node3.osuosl.bak is down. Error details: Node is marked offline
* deactivating the instance's disks on source node
 - WARNING: Could not shutdown block device disk/0 on node node3.osuosl.bak: Node is marked offline
* activating the instance's disks on target node
 - WARNING: Could not prepare block device disk/0 on node node3.osuosl.bak (is_primary=False, pass=1): Node is marked offline
* starting the instance on the target node
# gnt-instance failover --ignore-consistency vm5

Now lets fix the secondary storage for the other instances.

# gnt-instance replace-disks -n node2 vm1
 - INFO: Old secondary node3.osuosl.bak is offline, automatically enabling early-release mode
Replacing disk(s) 0 for vm1.osuosl.org
STEP 1/6 Check device existence
 - INFO: Checking disk/0 on node1.osuosl.bak
 - INFO: Checking volume groups
STEP 2/6 Check peer consistency
 - INFO: Checking disk/0 consistency on node node1.osuosl.bak
STEP 3/6 Allocate new storage
 - INFO: Adding new local storage on node2.osuosl.bak for disk/0
STEP 4/6 Changing drbd configuration
 - INFO: activating a new drbd on node2.osuosl.bak for disk/0
 - INFO: Shutting down drbd for disk/0 on old node
 - WARNING: Failed to shutdown drbd for disk/0 on oldnode: Node is marked offline
      Hint: Please cleanup this device manually as soon as possible
 - INFO: Detaching primary drbds from the network (=> standalone)
 - INFO: Updating instance configuration
 - INFO: Attaching primary drbds to new secondary (standalone => connected)
STEP 5/6 Removing old storage
 - INFO: Remove logical volumes for 0
 - WARNING: Can't remove old LV: Node is marked offline
      Hint: remove unused LVs manually
 - WARNING: Can't remove old LV: Node is marked offline
      Hint: remove unused LVs manually
STEP 6/6 Sync devices
 - INFO: Waiting for instance vm1.osuosl.org to sync disks.
 - INFO: - device disk/0:  0.00% done, no time estimate
 - INFO: - device disk/0: 25.00% done, 2h 23m 24s remaining (estimated)
 - INFO: - device disk/0: 50.40% done, 47m 38s remaining (estimated)
 - INFO: - device disk/0: 76.40% done, 26m 46s remaining (estimated)
 - INFO: - device disk/0: 92.20% done, 7m 49s remaining (estimated)
 - INFO: - device disk/0: 100.00% done, 0s remaining (estimated)
 - INFO: Instance vm1.osuosl.org's disks are in sync.

By using --submit you are able to let the output go into the background. You can view the output in real-time by running gnt-job watch <job id>. I went ahead and told ganeti replace the secondary disks on the other two machines at the same time. Be careful running too many replace disk operations as you may run into disk I/O issues on the nodes.

Now there is another way I could have fixed this and would have required less steps by using gnt-node evacuate. This command allows you to move all the secondary storage from a single node to another node quickly instead of doing it vm-by-vm. The command probably would have looked something similar to this:

# gnt-node evacuate --force -n node2 node3 

Instead of specifying which node to migrate storage to, you can also use an IAllocator plugin to automatically pick which node to use. So the command above would have been:

# gnt-node evacuate --force -I hail node3 

After a few minutes I brought redundancy back into my cluster, instances back online, an with no data loss.

Ganeti rocks!

After nearly a month and a half (42 days) of development since 0.4 was released, the OSUOSL has released Ganeti Web Manager 0.5 today. This second release has some very nice new features included in it:

Read the full ChangeLog for more details.

noVNC Console

My favorite new feature by far is the inclusion of noVNC by default for VNC console access. This removes the Java requirement for your browsers and makes it much easier to use. It works the best using Chrome/Chromium but you can also use Firefox.

noVNC console

New Overview Page

I’m also excited about the new overview pages for users and admins. It makes it much easier to see the usage of your cluster(s) quickly. For users it will show some basic resource/quota usage.

New Overview Page

Upgrading

If you’re upgrading from 0.4 be sure to read the upgrading wiki page and go over the installation page again. We’ve added a few new requirements such as South for database migrations and Twisted for the new VNC Auth Proxy.

Be sure to also check out Peter’s blog post about the 0.5 release as well!

For the last several months I’ve been trying to find a new hobby that is totally different than anything else I’ve been interested in. Like any geek I always find myself dinking around on my laptop at night so it makes me wonder what I’m doing with my brain looking at a computer screen all day. Thankfully I do enjoy another hobby with music and playing my trumpet in a local jazz band. It fills a specific kind of a void but I wanted to have something new to try.

My dad recently told me he decided to get radio controlled flying and had gotten a couple of helicopters and a simple plane from a hobby store in Topeka, Kansas. After talking to him it suddenly triggered childhood memories of going to RC airplane shows and wanting to get into the hobby at some point. After being so involved with band and computers for so long I had forgotten that dream … until now.

A few weeks ago I decided to buy a coaxial helicopter from a local hobby store Trump Hobbies. Its a Blade mCX2 and cost around $135 which includes a controller and batteries. It’s a really easy micro-helicopter to fly around the apartment and was the catalyst for me getting re-interested in RC. I eventually bought a new carbon fiber tail fin to make it perform better but I knew I could have more fun.

About two weeks later I decided to “upgrade” to a single-rotor helicopter and get something bigger as well. I ended up getting the Blade 120 SR ($150) which is twice the size as the Blade mCX2. I took it out for a spin and soon discovered how more difficult a single-rotor heli can be. Since the heli was a little too big for me to fly around the apartment (at the time) I decided to get the Blade mSR ($99) the next day which is the same size as the mCX2 but a single-rotor design.

The nice thing about getting the mSR is that it came with a 4-port battery charger with an AC adapter so now I could charge more than one battery at a time. I have enough spare batteries that I can fly the mSR almost definitely (which may or may not be a good thing). I have four 500mAh batteries for the 120 SR which gives me about 45 minutes of fly time.

Blade mCX2, Blade 120 SR, Blade mSR
My heli fleet thus far. (left to right: Blade mCX2, Blade 120 SR, Blade mSR)

That brought my “fleet” up to three aircraft. I’ve been flying the mSR around the apartment and really starting to get the hang of it. It’s also been helping me train for flying the larger 120 SR. I’ve been trying to make it to a local park after work to get an hour or so of fly time with the 120 SR and mSR. I’m lucky that Oregon is blessed with low winds in the valley generally so I can enjoy these helicopters more. So far I haven’t crashed them to the point of having to do major repair on them (knock on wood). I’m really impressed with the durability of the helis, especially the mSR. I’m also impressed with their performance capability. The only problem I’ve had is losing rotor linkages on the 120 SR but thankfully I have enough spares to make it through a session.

So far Miles (my cat) has enjoyed watching me fly the helis around the apartment but he doesn’t like it when I try flying closely around him. Imagine that! Anyways, I’ll try blogging about this new found hobby of mine when I can. I’ve already made a wish list which includes several airplanes and a better transmitter. This new hobby is certainly a nice escape from work.

Want to work at the coolest place for open source and support the missions of some of the most important open source projects?

Oregon State University’s Open Source Lab is recruiting a full-time software developer who will analyze, design, and test software code for Ganeti Web Manager, the Protein Geometry Database and several other homegrown Open Source Lab projects. Development at the OSUOSL includes collaborations with academic and research faculty internal and external to OSU.

Reporting to the Operations Manager of the Open Source Lab, the Analyst Programmer will contribute in-depth knowledge of open source software development using languages such as Python, Ruby and Java. The person in this position is responsible for developing and modifying complex software applications, documenting code and development processes, and overseeing student software developers. This position will allow the candidate to interact with many of the open source projects hosted by the OSL. We seek candidates with a high level of initiative, motivation, and a high degree of success in previous endeavors.

To review more a more detailed job description and apply, check out the Analyst Programmer role on Oregon State University’s Jobs page.