You may have seen one of the following messages in the log:
ERROR : Database connection disconnected
ERROR : sync out operation connection error
WARNING : database sync operation failed
WARNING : database connection failure
To investigate:
- Check if there are any database error alarms on any Callbridge server. Output of each callbridge : https://a.b.c.d/api/v1/system/alarms
- Check if there are any fault conditions on the units web interface > general status page
If there are no database error alarms or faults reported in step 1 and 2, it means that the database is already recovered and in working condition.
The logged warning/error message is usually due to a short period of network failure between database servers. If the warnings/error messages reappear from time to time, we suggest that you try to find the cause.
Furthermore, you can:
- Check the database cluster status from the MMP on each database server in the cluster node.
- On MMP, issue command: “database cluster status”. If a database primary presents and other database servers are in “connected” and “In Sync” state, it means database cluster is in working condition. Maybe just the database primary has been switched.
- Download the log file from each core server and analyze it. (it can be download via SFTP. The file name is “log”.)
- Search “sfpool_proxy” in all servers’ log files, you may find the following messages. This tells us when the database primary was trying to get switched
- sfpool_proxy: Terminated
- sfpool_proxy: Sfpool proxy restarting
- sfpool_proxy: Sfpool proxy up with remote host
- Look at the other sfpool messages immediately before the "Terminated" message to see why the switch has been initiated.
- sfpool: Health check a.b.c.12 primary check failure:
- Look at the other sfpool messages immediately after the "Terminated" message where you may find when database primary gets switched.
- sfpool: Failover monitor: Following primary from node -1 to node x
Database primary switch
All servers in a database cluster are doing a database status check on regular intervals. If this check succeeds nothing happens, everything remains as it is. If this request fails, or fails to complete within a certain time period, the node is considered disconnected. As soon as a node is considered disconnected, if three successive primary checks fail, one of the replica nodes will promote itself (at which point, a new primary presents).
Should I contact support?
Always go through the above steps 1-4 before contacting support.
The database is not in a good condition, if you experience any of the following issues:
- You see a database alarm or fault condition when going through step 1 and 2 above
- There is no primary
- One (or more) of the database servers is not in “connected” or “In Sync” state
If so, please contact Cisco support immediately. Send in all information mentioned in all 4 steps above (Please just include the last 24 hours log messages - covering the time frame when the fault occurred).
- Last update:
- 08-Sep-2020
- FAQ ID:
- 1211