Tuesday, November 19, 2019

applmgmt service wont start on PSC Appliace post converge operation

Scenario

We had a vCenter with External PSC. We converged them and converge job was successful execpt a cert warning.
After a week we tried to decommission the old PSC appliance and found that the status is shown in WebClient as "Unknown"

Up on checking we found applmgmt in stopped state. Tried to start it but it failed with below error

[ ~ ]# service-control --status
Running:
 lwsmd pschealth vmafdd vmcad vmdird vmdnsd vmonapi vmware-analytics vmware-certificatemanagement vmware-cis-license vmware-cm vmware-rhttpproxy vmware-sca vmware-sts-idmd vmware-stsd vmware-vapi-endpoint vmware-vmon
Stopped:
 applmgmt vmware-statsmonitor


[ ~ ]# service-control --start applmgmt
Operation not cancellable. Please wait for it to finish...
Performing start operation on service applmgmt...
Error executing start on service applmgmt. Details {
    "detail": [
        {
            "translatable": "An error occurred while starting service '%(0)s'",
            "id": "install.ciscommon.service.failstart",
            "args": [
                "applmgmt"
            ],
            "localized": "An error occurred while starting service 'applmgmt'"
        }
    ],
    "componentKey": null,
    "resolution": null,
    "problemId": null
}
Service-control failed. Error: {
    "detail": [
        {
            "translatable": "An error occurred while starting service '%(0)s'",
            "id": "install.ciscommon.service.failstart",
            "args": [
                "applmgmt"
            ],
            "localized": "An error occurred while starting service 'applmgmt'"
        }
    ],
    "componentKey": null,
    "resolution": null,
    "problemId": null
}



We has this issue on two infrastructures and we could fix it one

FIX that worked on first PSC

# List all disabled services for removal.  
find /etc/systemd/system/ -lname '/dev/null' -exec ls {} \;   
 
# Automatically remove them (or rm each file) 
find /etc/systemd/system/ -lname '/dev/null' -exec rm {} \;  
 
# Relaod systemctl daemon 
systemctl daemon-reload  
 
# Start services or Reboot 
service-control --start --all  


However second PSC was not happy still. So we had to manfully remove the replication manually

Manual Removal of the replication

1) Shutdown both PSC and vCenters and take an offline snap
2) Power on only vCenter. Do not start PSC
3) SSH to vCenter and run below commands

a) List all PSCs connected
]# ./vdcrepadmin -f showservers -h localhost -u administrator -w XXXX
cn=oldpscappliance.mydomain.com,cn=Servers,cn=Sites,cn=Configuration,dc=vsphere,dc=local
cn=vcenter.mydomain.com,cn=Servers,cn=Sites,cn=Configuration,dc=vsphere,dc=local


Note -- XXXX is the SSO password for administrator@vsphere.local 

I can now see two, old PSC appliance and also the vCenter with PSC converged in to it.
Ran below command to make sure vCenter is pointing to converged PSC and not the old appliance

]# /usr/lib/vmware-vmafd/bin/vmafd-cli get-ls-location --server-name localhost
https://vcenter.mydomain.com:443/lookupservice/sdk


Output confirmed that the PSC appliance is not in use. So decided to manually remove the association.

# /bin/cmsso-util unregister --node-pnid oldpscappliance.mydomain.com --username administrator --passwd XXXX

Watch theoutput basically ends like this

2019-11-12T08:29:24.939Z  Running command: ['/usr/lib/vmware-vmafd/bin/dir-cli', 'service', 'list', '--login', 'administrator']
2019-11-12T08:29:25.059Z  Done running command
Stopping all the services ...
All services stopped.
Starting all the services ...
Started all the services.
Success

2019-11-12T08:33:13.071Z  Running command: ['/usr/bin/sed', '-i', '-e', 's/cmsso-util.*/cmsso-util/g', '/var/log/vmware/procstate']
2019-11-12T08:33:13.829Z  Done running command

Login to the vCenter via WebClient and under Administration ->  System Configuration makesure that the old PSC is listed anymore.


You may keep the old PSC appliance for a few days and delete it once it's all good. 


Thursday, November 14, 2019

vMotion Failing at 21% with error ""The vMotion failed because the destination host did not receive data from the source host on the vMotion network. Please check your vMotion network settings and physical network configuration and ensure they are correct."

vMotion Failing at 21% with error ""The vMotion failed because the destination host did not receive data from the source host on the vMotion network. Please check your vMotion network settings and physical network configuration and ensure they are correct."

Scenario

We built a new ESXi 6.7 Cluster and we couldn't make vMotions work there

Troubleshooting steps

1) Make sure there is no IP address conflict.

2) SSH to ESXi and do a VMK Ping check

For Default TCP/IP Stack

vmkping 10.11.7.188

Or If you are using Multi NIC vMotion, you might want to specify which VMK interface to use
vmkping -I vmk<vmkinterfacenumber> <Destination VMK IP to Ping>

Eg -: vmkping -I vmk3 192.168.1.1

For vMotion Stack

If you try above vmkping command on a ESXi host with VMK interfaces on VMK stack, they will fail with an error 
Unknown interface 'vmk': Invalid argument
Because the command is looking for TCP/IP stack by default and this VMK wont be listed there. So you need to specify that.

vmkping -I vmk<vmkinterfacenumber> -S vmotion <Destination VMK IP to Ping>

Eg-: vmkping -I vmk3 -S vmotion 192.168.1.1

If  the Ping test fails, check 
a) vMotion port group settings
b) ETXi Host Configuration -> Networking -> VMKernal Adapters and make sure vmotion is not enabled on the current VMK and only on the correct one. 

3) Check vMotion network port status on ESXi hosts to see if it's listening
We can use netcat utility for this test, similar to telnet test we do in Windows.

nc -z <Destination VMK IP> 8000

In the first try it took me you to the next line immedetly, connection is not establicked.
If it connects, it will stay running like I have in the second try for a while before it stops the connection. 


 

Another way to know that is to check network port status. Below is the equivalent to netstat command.
So what you need to do is, while above command is runnig and still active, open a new ssh session to same ESXi and also to Destination ESXi and look at the listening ports

esxcli network ip connection list | grep -i 8000


If  the connection test fails, revisit ESXi firewall rules using web client