We have a database cluster, and I see a database error or warning in the log. What should I do?


You may have seen one of the following messages in the log:

ERROR : Database connection disconnected
ERROR : sync out operation connection error
WARNING : database sync operation failed
WARNING : database connection failure 

To investigate 

  1. Check if there are any database error alarms on any Callbridge server. Output of each callbridge : https://a.b.c.d/api/v1/system/alarms
  2. Check if there are any fault conditions on the units web interface > general status page

If there are no database error alarms or faults reported in step 1 and 2, it means that the database is already recovered and in working condition.

The logged warning/error message is usually due to a short period of network failure between database servers. If the warnings/error messages reappear from time to time, we suggest that you try to find the cause.

Furthermore, you can:

  1. Check the database cluster status from the MMP on each database server in the cluster node. 
    • On MMP, issue command: “database cluster status”. If a database primary presents and other database servers are in “connected” and “In Sync” state, it means database cluster is in working condition. Maybe just the database primary has been switched.
  1. Download the log file from each core server and analyze it. (it can be download via SFTP. The file name is “log”.)
    • Search “sfpool_proxy” in all servers’ log files, you may find the following messages. This tells us when the database primary was trying to get switched
      • sfpool_proxy: Terminated
      • sfpool_proxy: Sfpool proxy restarting
      • sfpool_proxy: Sfpool proxy up with remote host
    • Look at the other sfpool messages immediately before the "Terminated" message to see why the switch has been initiated.
      • sfpool: Health check a.b.c.12 primary check failure:
    • Look at the other sfpool messages immediately after the "Terminated" message where you may find when database primary gets switched.
      •  sfpool: Failover monitor: Following primary from node -1 to node x

Database primary switch
All servers in a database cluster are doing a database status check on regular intervals. If this check succeeds nothing happens, everything remains as it is. If this request fails, or fails to complete within a certain time period, the node is considered disconnected. As soon as a node is considered disconnected, if three successive primary checks fail, one of the replica nodes will promote itself (at which point, a new primary presents).

Should I contact support?
Always go through the above steps 1-4 before contacting support.

The database is not in a good condition, if you experience any of the following issues:

  • You see a database alarm or fault condition when going through step 1 and 2 above
  • There is no primary
  • One (or more) of the database servers is not in “connected” or “In Sync” state

If so, please contact Cisco support immediately. Send in all information mentioned in all 4 steps above (Please just include the last 24 hours log messages - covering the time frame when the fault occurred).

 

Last update:
08-Sep-2020
FAQ ID:
1211