Tech Talks - OS, Storage, Backup and Virtualization

Sunday, December 29, 2019

Horizon View - "Failed to connect to Connection Server" when accessed via LB WIP or DNS alias

Scenario

Horizon View - "Failed to connect to Connection Server" when accessed via LB WIP or DNS alias

Works fine when accessed with server FQDN

Solution

If you are facing this issue after upgrading to view 7.X, you are not alone! And this is not an issue.

It's a new security feature part of 7.X and can be disable by steps mentioned in this KB.

https://kb.vmware.com/s/article/2144768

All you need to do is

> Create file with the name locked.properties

> Add line "checkOrigin=false" (without quotes)

> Save and copy this to C:\Program Files\VMware\VMware View\Server\sslgateway\conf folder on all your connection servers.

> Reboot them or restart connection service on them one by one like you normally do

Thursday, December 12, 2019

VMware Horizon View 7.X desktop “Agent unreachable” status

Scenario -:

We had a VDI user reporting issues connecting to his VDI machine.

Checking View Admin page shows this VDI machine

First thing first

1) Checked vCenter and made sure that the VM is up and running, not down or suspended.

2) I could remote desktop to it and checked services

3) Restarted Agent Service. No luck

4) Rebooted VM, no luck there too.

Started to look at the logs at this point

C:\ProgramData\VMware\VDM\logs

debug-2019-12-12-150326.txt

2019-12-12T15:03:35.940+10:00 DEBUG (25A0-26C4) <Thread-4> [AgentMessageSecurityHandler] Configuring message security (ON).
2019-12-12T15:03:36.033+10:00 DEBUG (25A0-26C4) <Thread-4> [BrokerUpdateUtility] Published CHANGEKEY request
2019-12-12T15:03:51.035+10:00 DEBUG (25A0-26C4) <Thread-4> [BrokerUpdateUtility] Timeout waiting for success response

So looks like it was trying to change the Key, but wasn't successful. So I decided to push it from

Connection server instead

1) Login to one of out View Connection Servers

2) Opened a CMD as Admin

3) Ran below commands

Cd C:\Program Files\VMware\VMware View\Server\tools\bin

vdmadmin -A -d <Name of the Desktop Pool> -m <Machine Name> -resetkey

4) You should be able to see the Agent Public Key listed there and thats all good.

5) Wait for a few mins and could can see status reporting :)

Reporting as "Unassigned User Connected" coz it was assigned to someone else and I logged in there with Admin ID. So rebooted VM and all good afterwords.

Update - 20/03/2020

We have see the same issue when few users enabled installed Docker and part of that Hyper-V feature was enabled!

Removed Hyper-V feature from Add/Remove Programs -> Turn Windows Feature On/Off

Rebooted the VDI machine and that brought Agent back online

Tuesday, December 10, 2019

vCenter SSO User password Expired

We had a vCenter SSO user created for SRM and it's password expired. Here is how you can check it and fix it.

User name is srm@vsphere.local

1) Login to VCSA with SSH and below are commands

root@vcenterserver [ ] cd /usr/lib/vmware-vmafd/bin/

root@vcenterserver [ /usr/lib/vmware-vmafd/bin ]# ./dir-cli user find-by-name --account srm --level 2
Enter password for administrator@vsphere.local:
Account: srm
UPN: srm@VSPHERE.LOCAL
Account disabled: FALSE
Account locked: FALSE
Password never expires: FALSE
Password expired: TRUE

root@vcenterserver [ /usr/lib/vmware-vmafd/bin ]#./dir-cli user modify --account srm --password-never-expires
Enter password for administrator@vsphere.local:
Password set to never expire for [srm].

root@vcenterserver [ /usr/lib/vmware-vmafd/bin ]#./dir-cli password reset --account srm --password XXXXXXXX

Tuesday, November 19, 2019

applmgmt service wont start on PSC Appliace post converge operation

Scenario

We had a vCenter with External PSC. We converged them and converge job was successful execpt a cert warning.
After a week we tried to decommission the old PSC appliance and found that the status is shown in WebClient as "Unknown"

Up on checking we found applmgmt in stopped state. Tried to start it but it failed with below error

[ ~ ]# service-control --status
Running:
lwsmd pschealth vmafdd vmcad vmdird vmdnsd vmonapi vmware-analytics vmware-certificatemanagement vmware-cis-license vmware-cm vmware-rhttpproxy vmware-sca vmware-sts-idmd vmware-stsd vmware-vapi-endpoint vmware-vmon
Stopped:
applmgmt vmware-statsmonitor

[ ~ ]# service-control --start applmgmt
Operation not cancellable. Please wait for it to finish...
Performing start operation on service applmgmt...
Error executing start on service applmgmt. Details {
    "detail": [
        {
            "translatable": "An error occurred while starting service '%(0)s'",
            "id": "install.ciscommon.service.failstart",
            "args": [
                "applmgmt"
            ],
            "localized": "An error occurred while starting service 'applmgmt'"
        }
    ],
    "componentKey": null,
    "resolution": null,
    "problemId": null
}
Service-control failed. Error: {
    "detail": [
        {
            "translatable": "An error occurred while starting service '%(0)s'",
            "id": "install.ciscommon.service.failstart",
            "args": [
                "applmgmt"
            ],
            "localized": "An error occurred while starting service 'applmgmt'"
        }
    ],
    "componentKey": null,
    "resolution": null,
    "problemId": null
}

We has this issue on two infrastructures and we could fix it one

FIX that worked on first PSC

# List all disabled services for removal.
find /etc/systemd/system/ -lname '/dev/null' -exec ls {} \;

# Automatically remove them (or rm each file)
find /etc/systemd/system/ -lname '/dev/null' -exec rm {} \;

# Relaod systemctl daemon
systemctl daemon-reload

# Start services or Reboot
service-control --start --all

However second PSC was not happy still. So we had to manfully remove the replication manually

Manual Removal of the replication

1) Shutdown both PSC and vCenters and take an offline snap
2) Power on only vCenter. Do not start PSC
3) SSH to vCenter and run below commands

a) List all PSCs connected
]# ./vdcrepadmin -f showservers -h localhost -u administrator -w XXXX
cn=oldpscappliance.mydomain.com,cn=Servers,cn=Sites,cn=Configuration,dc=vsphere,dc=local
cn=vcenter.mydomain.com,cn=Servers,cn=Sites,cn=Configuration,dc=vsphere,dc=local

Note -- XXXX is the SSO password for administrator@vsphere.local

I can now see two, old PSC appliance and also the vCenter with PSC converged in to it.
Ran below command to make sure vCenter is pointing to converged PSC and not the old appliance

]# /usr/lib/vmware-vmafd/bin/vmafd-cli get-ls-location --server-name localhost
https://vcenter.mydomain.com:443/lookupservice/sdk

Output confirmed that the PSC appliance is not in use. So decided to manually remove the association.

# /bin/cmsso-util unregister --node-pnid oldpscappliance.mydomain.com --username administrator --passwd XXXX

Watch theoutput basically ends like this

2019-11-12T08:29:24.939Z Running command: ['/usr/lib/vmware-vmafd/bin/dir-cli', 'service', 'list', '--login', 'administrator']
2019-11-12T08:29:25.059Z Done running command
Stopping all the services ...
All services stopped.
Starting all the services ...
Started all the services.
Success
2019-11-12T08:33:13.071Z Running command: ['/usr/bin/sed', '-i', '-e', 's/cmsso-util.*/cmsso-util/g', '/var/log/vmware/procstate']

2019-11-12T08:33:13.829Z Done running command

Login to the vCenter via WebClient and under Administration -> System Configuration makesure that the old PSC is listed anymore.

You may keep the old PSC appliance for a few days and delete it once it's all good.

Thursday, November 14, 2019

vMotion Failing at 21% with error ""The vMotion failed because the destination host did not receive data from the source host on the vMotion network. Please check your vMotion network settings and physical network configuration and ensure they are correct."

vMotion Failing at 21% with error ""The vMotion failed because the destination host did not receive data from the source host on the vMotion network. Please check your vMotion network settings and physical network configuration and ensure they are correct."

Scenario

We built a new ESXi 6.7 Cluster and we couldn't make vMotions work there

Troubleshooting steps

1) Make sure there is no IP address conflict.

2) SSH to ESXi and do a VMK Ping check

For Default TCP/IP Stack

vmkping 10.11.7.188

Or If you are using Multi NIC vMotion, you might want to specify which VMK interface to use
vmkping -I vmk<vmkinterfacenumber> <Destination VMK IP to Ping>

Eg -: vmkping -I vmk3 192.168.1.1

For vMotion Stack

If you try above vmkping command on a ESXi host with VMK interfaces on VMK stack, they will fail with an error
Unknown interface 'vmk': Invalid argument
Because the command is looking for TCP/IP stack by default and this VMK wont be listed there. So you need to specify that.

vmkping -I vmk<vmkinterfacenumber> -S vmotion <Destination VMK IP to Ping>

Eg-: vmkping -I vmk3 -S vmotion 192.168.1.1

If the Ping test fails, check
a) vMotion port group settings
b) ETXi Host Configuration -> Networking -> VMKernal Adapters and make sure vmotion is not enabled on the current VMK and only on the correct one.

3) Check vMotion network port status on ESXi hosts to see if it's listening
We can use netcat utility for this test, similar to telnet test we do in Windows.

nc -z <Destination VMK IP> 8000

In the first try it took me you to the next line immedetly, connection is not establicked.
If it connects, it will stay running like I have in the second try for a while before it stops the connection.

Another way to know that is to check network port status. Below is the equivalent to netstat command.
So what you need to do is, while above command is runnig and still active, open a new ssh session to same ESXi and also to Destination ESXi and look at the listening ports

esxcli network ip connection list | grep -i 8000

If the connection test fails, revisit ESXi firewall rules using web client

Thursday, September 5, 2019

CISCO ACI + vCenter Integration Error "Create a vSphere Distributed Switch" "Status: The operation is not supported on the object"

"Create a vSphere Distributed Switch"

"Status: The operation is not supported on the object"

We were using CISCO ACI + vSphere for a while. We now have a new datacentre coming up and there were conversations on ACI vs NSX. But finally decided to give ACI a second chance :(

Current datacenters are on vSphere 6.0 and were linked to ACI with no issues.

We decided to go with latest 6.7 vSphere and Built them. ESXi hosts were added on to vCenter and a management DVS was created manually (we always had management DVS out side ACI!) and ESXi VMK0 was moved there, basically all went as per plan and well.

Then came the ACI integration part... We had 3 vCenters and they were all under same VMM domain. So we decided to add the new VC also there.

ACI Role was created in vCenter for permissions, service account is configured there mapping it to the new role. Time came for the integration and it wasn't happy about something.. After adding the new VC, ACI could see ESXi inventory. But the DVS creation was failing with below error in vCenter

Operation is not supported! Come on VMware what operation!

Anyways I had to gig though logs to find out. Finally below is what I found in VPXA log

2019-09-06T01:33:50.907Z error vpxd[04740] [Originator@6876 sub=DvsUtils opID=7a4ad61f] Non-VMware DVS [Cisco Systems Inc.: ] is not supported
2019-09-06T01:33:51.003Z warning vpxd[04740] [Originator@6876 sub=dvsKeeper opID=7a4ad61f] DVS name [virtual] not in reserved map of DvsManager instance
2019-09-06T01:33:51.003Z info vpxd[04740] [Originator@6876 sub=vpxLro opID=7a4ad61f] [VpxLRO] -- FINISH task-3903
2019-09-06T01:33:51.003Z info vpxd[04740] [Originator@6876 sub=Default opID=7a4ad61f] [VpxLRO] -- ERROR task-3903 -- group-n41 -- vim.Folder.createDistributedVirtualSwitch: vmodl.fault.NotSupported:

So, the issue was.....
Old VM domains were created log time back and were set to use Cisco Systems Inc. as the vendor. VMware 6.0 just dosent care (for now! But upgrade to 6.5 will fail and if you look got KBs there is a way to modify it with a SQL command. Atelast we havent go that far and we will be migrating all VMs to the new DC once built and old once will be decommessioned!). But starting from 6.5 U1 VMware stopped supporting third parity DVS switches. Instead they've opned up APIs and said these vendors can now use APIs create and consume VDS (VMware Distributed Switches).

Now getting back to how we fixed it. Our ACI was maintained well was updated to the latest. So we could just create a new VMM domain and could specify VMware there. All worked well!

PowerCLI Install error on powershell

Today I was trying to install PowerCli on my Windows 10 machine using below command

Install-Module -Name VMware.PowerCLI -RequiredVersion 11.1.0.11289667

Ended up with an error "PackageManagement\Install-Package : The following commands are already available on this system:'Get-Cluster,New-Cluster,Remove-Cluster'. This module 'VMware.VimAutomation.Core' may override the existing commands. If you still want to install this module"

Looked like some of the modules are conflicting.
As per MS document here "If the module being installed has the same name or version, or contains commands in an existing module, warning messages are displayed. After you confirm that you want to install the module and override the warnings, use the -Force and -AllowClobber"

So decided to use -AllowClobber and it worked!