Crash Detection And Recovery – Fire And Ice Grid

Crash Detection and Recovery is an update to the Fire And Ice Opensim Grid and to the Automated Opensim Startup and Shutdown guide.

Crash Detection And Recovery
Crash Detection And Recovery – Fire And Ice Grid

The most common way of having items restart is by using systemD. Unfortunately, when running mono processes inside a Tmux shell, this didn’t work satisfactorily.

My solution is to use keep-alive bash script for Robust and each simulator group. A cron job then runs these scripts at regular intervals. The script performs a three-stage check. Initially, it looks for the PID file to see if it is present. Secondly, if the file is present, it uses the process ID to check if it is running. Finally, if the process is running, it uses curl to send an HTTP call to test for responsiveness. If a check fails, the process is stopped and then started.

Crash Detection And Recovery – Keep Alive Script

#!/bin/bash

SimulatorGroup="Main"
Simulators=( Simulator00 Simulator02 Simulator03  )
Port=9000

check_pid_file_exists()
{
    #$1=file $2=Simulator $3=Port
    echo "checking for $1"
    if [ -f "$1" ]; then
        echo "$1 exists "
        read_pid_file $1 $2 $3
    else 
        REASON="because_no_pid_file"
        #echo "Debug: Restart goes here due to no pid file"
        restart_simulator $2 $REASON
    fi
}

read_pid_file()
{
    #$1=file $2=Simulator $3=Port
    PID=$(<$1)
    echo "process number = $PID"
    check_if_process_running $PID $2 $3
}

check_if_process_running()
{
    #$1=pidNumber  $2=Simulator $3=Port
    echo "checking to see if $1 is running"
    PROCESS=$1  
    REASON="because_process_not_running"
    pgrep mono | grep $1 >/dev/null && check_if_process_frozen $2 $3 || restart_simulator $2 $REASON
}

check_if_process_frozen()
{
    # $1=Simulator $2=Port
    echo "checking to see if $1 on port $2 is frozen"
    REASON="because_process_frozen"
    Check="OK"
    Curl="/usr/bin/curl -s"
    Address="localhost:$2/simstatus/"
    Args="-w '%{response_code}'"
    Status=$(timeout 10s $Curl $Address)
    if [ $Status = "OK" ];
        then
        echo "Simulator: $1 is: $Status"    
    else
        echo "Simulator: $1 is: Frozen, begin resart"
        restart_simulator $1 $Reason
    fi
}

restart_simulator()
{
    # $1=Simulator $3=Reason
    echo "restarting $1 $2"
    Pre="Simulators"
    Stop="_Stop.sh"
    Start="_Start.sh"
    Switch="fast"
    CommandStop="$Pre$SimulatorGroup$Stop $Switch $1"
    CommandStart="$Pre$SimulatorGroup$Start $1" 
    ./$CommandStop
    ./$CommandStart
}

for Simulator in "${Simulators[@]}"
do
    FILEPATH="/tmp/"
    FILENAME="$Simulator.pid"
    FILE=$FILEPATH$FILENAME
    check_pid_file_exists $FILE $Simulator $Port
    Port=$(($Port+10))
done

Keep Alive Script Details – Crash Detection And Recovery

This script follows the same principle as the Automated Opensim Startup and Shutdown scripts. It also uses those files to preform stop and start routines. It will not work as a standalone script.

Keep Alive Script Variables

The ‘SimulatorGroup’ is part of the file name which will execute. This script uses two others, ‘SimulatorsMain_Stop.sh’ and ‘SimulatorsMain_Stop.sh’. The Simulators array contains the names of all the simulators. These as per the startup and shutdown examples match the name of the Tmux windows they run inside. The port number is the lowest port number of all the simulators.

Script Assumptions

Simulator Port Numbers

In this example, each simulators port number is exactly ten higher than the previous one. Simulator00 is on port 9000, Simulator01 is on 0910 going up in increments of 10.

PID file name and location

The script requires each simulator to save it’s PID file in ‘/tmp’ using its own name as the file name. E.g. ‘Simulator00’ runs inside a tmux window called ‘Simulator00’, which resides inside a Tmux Session called ‘SimulatorsMain’. The PID file for ‘Simulator00’ saves in ‘/tmp’ with the file name ‘Simulator00.pid’.

Setting Simulator PID file details and Port Numbers

The ‘Opensim.ini’ file for each simulator sets the port numbers and PID file names. Examples and more information about Opensim.ini is available in Opensim with multiple Robust services on Ubuntu.

Adding the Keep Alive Script to a cron job

In terminal type the following to open the con jobs for your user (assuming opensim is run as your user, otherwise adjust to roots cron jobs).

crontab -e

Use a text editor add the line below, adjusting as required. Many Opensimulator processes can take a substantial amount of time to close down cleanly. For this reason, it is not a good idea to reduce the checks to less than 15-minute intervals.

*/30   *      *       *       * /home/sara/SimulatorsMain_KeepAlive.sh

Save the file and exit.

Related Post

3 thoughts on “Crash Detection And Recovery – Fire And Ice Grid

Leave a Reply

Your email address will not be published. Required fields are marked *