<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Lance Albertson &#187; power outage</title>
	<atom:link href="http://www.lancealbertson.com/tag/power-outage/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.lancealbertson.com</link>
	<description>Musings of a geek, jazz performer, and an OSUOSL sysadmin</description>
	<lastBuildDate>Tue, 03 May 2011 06:09:36 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.2.1</generator>
		<item>
		<title>Power Outage: A true test for Ganeti</title>
		<link>http://www.lancealbertson.com/2010/05/power-outage-a-true-test-for-ganeti/</link>
		<comments>http://www.lancealbertson.com/2010/05/power-outage-a-true-test-for-ganeti/#comments</comments>
		<pubDate>Fri, 21 May 2010 00:06:09 +0000</pubDate>
		<dc:creator>lance</dc:creator>
				<category><![CDATA[gentoo]]></category>
		<category><![CDATA[linux]]></category>
		<category><![CDATA[opensource]]></category>
		<category><![CDATA[virtualization]]></category>
		<category><![CDATA[drbd]]></category>
		<category><![CDATA[ganeti]]></category>
		<category><![CDATA[kvm]]></category>
		<category><![CDATA[power outage]]></category>

		<guid isPermaLink="false">http://www.lancealbertson.com/?p=169</guid>
		<description><![CDATA[Nothing like a power outage gone wrong to test a new virtualization cluster. Last night we lost power in most of Corvallis and our UPS &#38; Generator functioned properly in the machine room. However we had an unfortunate sequence of issues that caused some of our machines to go down, including all four of our [...]]]></description>
			<content:encoded><![CDATA[<p>Nothing like a power outage gone wrong to test a new virtualization cluster. Last night we lost power in most of Corvallis and our UPS &amp; Generator functioned properly in the machine room. However we had an unfortunate sequence of issues that caused some of our machines to go down, including all <strong>four</strong> of our ganeti nodes hosting <strong>62 virtual machines</strong> went down hard. If this had happened with our old xen cluster with iSCSI, it would have taken us over an hour to get the infrastructure back in a normal state by manually restarting each VM.</p>
<p>But when I checked the <a href="http://code.google.com/p/ganeti/">ganeti</a> cluster shortly after the outage, I noticed that all four nodes rebooted without any issues and the master node was <strong>already</strong> rebooting virtual machines <strong>automatically</strong> and fixing all of the <a href="http://www.drbd.org/">DRBD</a> block devices. Ganeti has a nice app called <strong><code>ganeti-watcher</code></strong> which is run every five minutes via cron. It has two primary functions currently (taken from <code>ganeti-watcher(8))</code>:</p>
<ol>
<li>Keep running all instances as marked (i.e. if they were running, restart them)</li>
<li>Repair DRBD links by reactivating the block devices of instances which have secondaries on nodes that have rebooted.</li>
</ol>
<p>The watcher app took around 30 minutes to bring all 62 VMs back online. The load on most of the nodes didn&#8217;t go over 4 during the recovery which is quite impressive considering how much I/O its doing while VMs are booting. Normally the nodes have loads between 0.3 and 0.5. There were only 3 VMs that didn&#8217;t boot cleanly because of incorrect fstab entries or incorrect kernel path settings in ganeti which was easy to fix. I was surprised we didn&#8217;t have more issues like that.</p>
<p>While ganeti is bringing instances back online you can tail watcher.log which is generally at <code>/var/log/ganeti/watcher.log</code> and will show output similar to this:</p>
<pre class="brush: plain; title: ; notranslate">
2010-05-20 04:06:25,077:  pid=10202 INFO Restarting busybox.osuosl.org (Attempt #1)
2010-05-20 04:07:16,311:  pid=10202 INFO Restarting driverdev.osuosl.org (Attempt #1)
2010-05-20 04:07:18,346:  pid=10202 INFO Restarting pcc.osuosl.org (Attempt #1)
</pre>
<p>And once its finished will show output like this:</p>
<pre class="brush: plain; title: ; notranslate">
2010-05-20 04:35:04,066:  pid=22741 INFO Restart of busybox.osuosl.org succeeded
2010-05-20 04:35:04,066:  pid=22741 INFO Restart of driverdev.osuosl.org succeeded
2010-05-20 04:35:04,066:  pid=22741 INFO Restart of pcc.osuosl.org succeeded
</pre>
<p>It was great watching this system recover everything automatically with little issues and quickly. Needless to say, outages are a bad thing and its our fault that our cluster went down like this but it was great seeing this system work nearly flawlessly. We&#8217;ll soon fix the power situation for our cluster so this shouldn&#8217;t happen again.</p>
<p>Take that ESX ;-)</p>
]]></content:encoded>
			<wfw:commentRss>http://www.lancealbertson.com/2010/05/power-outage-a-true-test-for-ganeti/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>

