Recovering RHEV Virtual Guests in “Unknown” status after losing access to storage

The last thing you want in a clustered virtualization environment is to lose access to your storage. No storage means no virtual server.

A former RHEV client of mine had a scenario where their storage was some how inaccessible to all the hypervisors in the cluster. They power cycled the RHEV hypervisors and even rebooted the RHEV-M management server a few times, but no matter what they did, all the virtual servers were offline, yet the RHEV-M web interface still reported that they were in an “Unknown” state. Not up and online where you could power it off, and not down where you could power it up, but “Unknown” where basically you can’t do anything.

Without a doubt, the first thing you should be doing in this case is phone Red Hat support and let them know what’s going on.  They will always know the best way to proceed.

I’d like to thank ‘eprasad’ on the #rhev channel on FreeNode for the assistance with this issue.

I have just had the same problem as my former client. All my virtual guests were in an unknown state after my storage domain became inaccessible.

I am using RHEV 3.1, with RHEV-H hypervisors connected to NFS storage.

Basically, the in’s and out’s of this problem seems to be that the RHEV-M database is in a state of limbo after losing the storage and a bit of manual intervention is required.

Again, I stress you should be speaking with Red Hat before doing this.

 

If you are in the exact scenario, you can manipulate the backend RHEV-M database and manually set the virtual guests to a down state.

To do this, login to a shell on the RHEV-M server and connect to the local Postgres database.

[root@rhevm ~]# psql -U engine 
Password for user engine: 
psql (8.4.13)
Type "help" for help.

engine=>

Once you’ve connect to the postgres shell, we need to find the VM GUID number that is associated to our virtual guest.

In this example, we will use the virtual guest “server01-example-com”.

To grab the VM GUID, run the following.

engine=> select vm_guid from vm_static where vm_name='server01-example-com';
               vm_guid                
--------------------------------------
 df13ca7f-b6fa-4820-8a7b-28d785221b64
(1 row)

 

Now that we have the GUID, copy that long output and we will use this in our next command.

Here we are telling RHEV-M to set the status of our virtual guest to a status of “0”. The “0” represents a status of “down”.

engine=> UPDATE vm_dynamic SET status=0 WHERE vm_guid='df13ca7f-b6fa-4820-8a7b-28d785221b64';
UPDATE 1

 

You should now see in the RHEV-M web interface that the status of your affected virtual guest has now been set to “down”. If your storage and hypervisors are back online, you will now be able to turn on your virtual guest and let it perform a normal boot up process.

 

7 comments on “Recovering RHEV Virtual Guests in “Unknown” status after losing access to storage

  1. Brandon Saccone March 17, 2013 17:49

    Dale, excellent article, something similiar has happened to me, whilst trying to upgrade RHEVM 3.0 to 3.1. Reverted back to 3.0 but i have the Hosts(Hypervisors) in an unknown state. Would the process you such also resolve a host unknown state? Thing is that the VMs in that Host are held within that host and i and worried re-adding the host to rhevm again will wipe all data including the vms within. Can you sugguest a way forward?

    • Dale Macartney March 17, 2013 18:15

      Hi Brandon

      The first thing I’d say is ring Red Hat support. If your entire cluster is offline, then raise a Priority 1 ticket.

      I have seen hypervisors in unknown states before, (normally on HP blades as they take very long periods of time to boot), however they eventually come back to an online state after a few minutes after boot.

      Do you have any hosts currently online in the cluster? I’d try doing a hard power cycle on the other hypervisors to give them a good kicking. Show them who’s boss. Your VM’s will be safe on the SAN.

      This article only covers changing the state of an offline VM in the RHEVM database. It is completely different to the hypervisors.

      Red Hat Support would definitely be your best source of information.

      Hope that helps.

      Dale

  2. Brandon Saccone March 17, 2013 19:08

    Thanks Dale for replying so quickly. I was able to recover 1 host (1Cluster) where the VMs are in a SAN, however my problem is with the other host that is in unknown state is a host that has its VMs in local storage (therefore not in the same cluster)ie no SAN, hence my issue. I have opened a case with Support prio 2 as the VMs are currenly running within the Host its just Rhevm has lost its ability to comm with it i guess.

    • Dale Macartney March 17, 2013 19:21

      Hi Brandon. You’re welcome, I’m always happy to help.

      It’s great to hear your VM’s are still online. It would be worthwhile getting the discussions going with the RHEV team early as it might be something you can resolve online with no down time.

      Good luck.

  3. JD May 20, 2014 19:06

    You sir saved me several hours waiting on RH to answer a simple question. If you had a tip jar I’d buy you a beer as this worked like a champ on my 3.3 RHEV environment.

  4. Pankaj Purohit June 4, 2014 07:18

    HI

    I have tried with psql -U engine but its not taking password.please help me how can i proceed to change its state.and when i tried to reset password for user engine in psql its showing role not defined.

    Regards,
    pankaj

  5. John Hardy June 13, 2014 03:46

    Great response to something that I have found to be common, when not shutting you laptop down properly whilst running nested virts of RHEV, you quite often get VM’s in an unknown state. Most people point to striping the VM direct from the postgres db, which is certainly NOT advised, your method gives control back to the manager so you can use the supported method of removal which is the UI. Thanks

Leave a Reply

Your email address will not be published.

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>