Boot ESX 4 from SAN on a failover LUN – Lessons learned.

Is it possible to boot an ESX 4 host from a LUN that has been “failed over” to another physical location? Yes, it is possible but there are some serious caveats by doing this. Please consider the following scenario:

2 datacenters: site A production (active), site B failover (passive)
In each site an identical SAN array; LUN mirroring between the 2 arrays through a dedicated Fiber Channel link
In each site 2 identical ESX hosts and both hosts boot from the SAN array
The vCenter Server is virtualized

After the ESX boot LUN’s have been migrated to site B and you want to boot an ESX host from the migrated LUN, the boot process will probably stop at the vsd-mount and the ESX host will fall back into the troubleshooting mode. This is because the ESX host recognizes the boot LUN as a snapshot. This issue can be solved as described in the VMware KB 1012142 article. So, let’s get started 🙂

In the troubleshooting mode, enable the resignature on the ESX host by typing: #esxcfg-advcfg –s 1 /LVM/EnableResignature
Unload the VMFS driver: #vmkload_mod –u vmfs3
Load the VMFS driver again: #vmkload_mod vmfs3
Detect and resignature the VMFS volume by typing: #vmfstools –V
Now, find the path to the esxconsole.vmdk file by typing: #find /vmfs/volumes/ -name esxconsole.vmdk
The output should look similar to the following example:/vmfs/volumes/4c7e41bc-acb14e48-eeb9-e61f137cb50f/esxconsole-4c57f62f-72a6-8e68-1e35-e41f1378a8e0/esxconsole.vmdk
Make a note of this output. You will need it later.
Reboot the ESX host.
Wait until the host boots up and you see the GRUB menu.
Highlight the “VMware ESX 4.0” and press the “e” button.
Select the “kernel /vmlinuz” section, press “e” button again and type the following after a space: /boot/cosvmdk=/esxconsole.vmdk. It should look similar to the following example: quiet /boot/cosvmdk=/vmfs/volumes/4c7e41bc-acb14e48-eeb9-e61f137cb50f/esxconsole-4c57f62f-72a6-8e68-1e35-e41f1378a8e0/esxconsole.vmdk
Press Enter to accept the changes and press the “b” button to start the boot process. The ESX host should start successfully.
Next, login at the service console with root user account and edit the esx.conf file located in the /etc/vmware directory: #vi /etc/vmware/esx.conf
Press Insert key, scroll down to the /adv/Misc/CosCorefile entry and change the path to the one noted in step 6. It should look similar to: /adv/Misc/CosCorefile = “/vmfs/volumes/4c7e41bc-acb14e48-eeb9-e61f137cb50f/esxconsole-4c57f62f-72a6-8e68-1e35-e41f1378a8e0/core-dumps/cos-core”
Scroll down to the /boot/cosvmdk entry and change the path to the one noted in step 6. The entry should read similar to: /boot/cosvmdk = “/vmfs/volumes/4c7e41bc-acb14e48-eeb9-e61f137cb50f/esxconsole-4c57f62f-72a6-8e68-1e35-e41f1378a8e0/esxconsole.vmdk”
Press ESC key and type: :wq
Press Enter key to save the changes to the esx.conf file.
Save the changes made to the boot configuration by typing: #esxcfg-boot –b
Reboot the ESX host.
Repeat step 1 to 18 for every ESX host.

OK, the ESX part is done. The hosts should boot without any problems. Now, let’s try to bring the vCenter Server VM up and running.

Login to one of the ESX hosts with the vSphere Client.
Locate the vCenter Server on the datastore (if the datastore appears as a snapshot, simply rename it to the correct name).
Add the vCenter Server VM to the inventory (if the vCenter Server has multiple disk drives located at different datastores, remove and re-add the disks to the vCenter Server VM).
Check if the Network Adapter of the VM is connected to the correct network.
Power on the VM.

Now that the virtual infrastructure is operational you can now restore the production VM’s. I’ve tested this procedure with 2 ESX hosts and 4 VM’s. It took me about 3 hours due to reboots and restore operations. Imagine doing this with 8 ESX hosts and 60 VM’s… you get the picture right? 🙂

Conclusion

Buy SRM! Or configure an Active/Active infrastructure.

Cheers!

– Marek.Z

Hi!
Many thanks!
Followed your guide, but have one problem:
After the operation, the ESX boot LUN name in Vcenter has changed. It’s snap-xxxxxxxx- (in Configuration-Storage-Identification).
As far as I can tell, everything else is normal.
Is it really a snap shot? How can I tell?
Or is it just a leftover from the earlier problem?
What should I do?

Marek.Z says:

8 May, 2012 at 16:03

Hi Anders,

In case of a LUN that has been replicated to another SAN, this is a normal behavior in vSphere 4.x, because there was a change to the disk or change to the controller connecting to the disk in which ESX(i) 4.x was previously installed. That’s why the disk is presented as a snapshot. You could resignature the disk and rename it but you will have to repeat the whole procedure as described in the blog post.

Hope this help.

Cheers!

– Marek.Z

Reply

AG says:

17 November, 2011 at 00:45

Can you please update the article if this happened on a ESXi host?
Thanks and great site!

- Marek.Z says:
  
  18 November, 2011 at 08:38
  
  Hi AG,
  
  I would love to but unfortunately I don’t have access to this environment…
  
  Cheers!
  
chokopo says:

18 January, 2012 at 18:23

So cool Marek.Z
Thank you.

- Marek.Z says:
  
  18 January, 2012 at 21:00
  
  Thanks again!
  
  Glad this blog post was useful to you.
  
  Cheers!
  
Anders says:

8 May, 2012 at 15:33

Hi!
Many thanks!
Followed your guide, but have one problem:
After the operation, the ESX boot LUN name in Vcenter has changed. It’s snap-xxxxxxxx- (in Configuration-Storage-Identification).
As far as I can tell, everything else is normal.
Is it really a snap shot? How can I tell?
Or is it just a leftover from the earlier problem?
What should I do?

- Marek.Z says:
  
  8 May, 2012 at 16:03
  
  Hi Anders,
  
  In case of a LUN that has been replicated to another SAN, this is a normal behavior in vSphere 4.x, because there was a change to the disk or change to the controller connecting to the disk in which ESX(i) 4.x was previously installed. That’s why the disk is presented as a snapshot. You could resignature the disk and rename it but you will have to repeat the whole procedure as described in the blog post.
  
  Hope this help.
  
  Cheers!
  
  – Marek.Z

Default Reasoning

Construction of sensible guesses when some useful information is lacking and no contradictory evidence is present…

Boot ESX 4 from SAN on a failover LUN – Lessons learned.

Conclusion

6 Comments

Leave a reply...Cancel reply

Conclusion

Be Sociable, Share!

6 Comments

Leave a reply...Cancel reply