Tuesday, November 19, 2019

applmgmt service wont start on PSC Appliace post converge operation

Scenario

We had a vCenter with External PSC. We converged them and converge job was successful execpt a cert warning.
After a week we tried to decommission the old PSC appliance and found that the status is shown in WebClient as "Unknown"

Up on checking we found applmgmt in stopped state. Tried to start it but it failed with below error

[ ~ ]# service-control --status
Running:
 lwsmd pschealth vmafdd vmcad vmdird vmdnsd vmonapi vmware-analytics vmware-certificatemanagement vmware-cis-license vmware-cm vmware-rhttpproxy vmware-sca vmware-sts-idmd vmware-stsd vmware-vapi-endpoint vmware-vmon
Stopped:
 applmgmt vmware-statsmonitor


[ ~ ]# service-control --start applmgmt
Operation not cancellable. Please wait for it to finish...
Performing start operation on service applmgmt...
Error executing start on service applmgmt. Details {
    "detail": [
        {
            "translatable": "An error occurred while starting service '%(0)s'",
            "id": "install.ciscommon.service.failstart",
            "args": [
                "applmgmt"
            ],
            "localized": "An error occurred while starting service 'applmgmt'"
        }
    ],
    "componentKey": null,
    "resolution": null,
    "problemId": null
}
Service-control failed. Error: {
    "detail": [
        {
            "translatable": "An error occurred while starting service '%(0)s'",
            "id": "install.ciscommon.service.failstart",
            "args": [
                "applmgmt"
            ],
            "localized": "An error occurred while starting service 'applmgmt'"
        }
    ],
    "componentKey": null,
    "resolution": null,
    "problemId": null
}



We has this issue on two infrastructures and we could fix it one

FIX that worked on first PSC

# List all disabled services for removal.  
find /etc/systemd/system/ -lname '/dev/null' -exec ls {} \;   
 
# Automatically remove them (or rm each file) 
find /etc/systemd/system/ -lname '/dev/null' -exec rm {} \;  
 
# Relaod systemctl daemon 
systemctl daemon-reload  
 
# Start services or Reboot 
service-control --start --all  


However second PSC was not happy still. So we had to manfully remove the replication manually

Manual Removal of the replication

1) Shutdown both PSC and vCenters and take an offline snap
2) Power on only vCenter. Do not start PSC
3) SSH to vCenter and run below commands

a) List all PSCs connected
]# ./vdcrepadmin -f showservers -h localhost -u administrator -w XXXX
cn=oldpscappliance.mydomain.com,cn=Servers,cn=Sites,cn=Configuration,dc=vsphere,dc=local
cn=vcenter.mydomain.com,cn=Servers,cn=Sites,cn=Configuration,dc=vsphere,dc=local


Note -- XXXX is the SSO password for administrator@vsphere.local 

I can now see two, old PSC appliance and also the vCenter with PSC converged in to it.
Ran below command to make sure vCenter is pointing to converged PSC and not the old appliance

]# /usr/lib/vmware-vmafd/bin/vmafd-cli get-ls-location --server-name localhost
https://vcenter.mydomain.com:443/lookupservice/sdk


Output confirmed that the PSC appliance is not in use. So decided to manually remove the association.

# /bin/cmsso-util unregister --node-pnid oldpscappliance.mydomain.com --username administrator --passwd XXXX

Watch theoutput basically ends like this

2019-11-12T08:29:24.939Z  Running command: ['/usr/lib/vmware-vmafd/bin/dir-cli', 'service', 'list', '--login', 'administrator']
2019-11-12T08:29:25.059Z  Done running command
Stopping all the services ...
All services stopped.
Starting all the services ...
Started all the services.
Success

2019-11-12T08:33:13.071Z  Running command: ['/usr/bin/sed', '-i', '-e', 's/cmsso-util.*/cmsso-util/g', '/var/log/vmware/procstate']
2019-11-12T08:33:13.829Z  Done running command

Login to the vCenter via WebClient and under Administration ->  System Configuration makesure that the old PSC is listed anymore.


You may keep the old PSC appliance for a few days and delete it once it's all good. 


No comments: