| Author: Jayakumar Srinivasan | Date: 17-Jan-2025 |
Introduction
In this article I am going to discuss about a auto re-try solution that I arrived on running a windows task scheduler in Azure VM (Well, this will work on a console application deployed on to Azure VM a s well. This has reduced the troubleshooting and support task to a great extent to my team. Well some might argue that same can be done with Azure Function or Web Jobs but it was decided based on some limitations we encountered using these and that is out of scope of this article.
Problem Description
We have a system developed as console application and deployed as Windows Task in a Azure VM. This system was scheduled to run early morning before business hours to sync data between 2 systems. There were numerous integrations that needs to be synchronized and there are separate task schedules for each of them that were configured to runs on different time but all completed before start of business hours. We identified that due to some technical limitation to one of partner system involved in synchronization was causing random failures on one or many integrations on daily basis. The application support team needs to monitor the failed integration and re-run the task scheduler on Azure VM numerous times till the integration succeeds.
Existing System Limitations
Unreliable Downstream Service
The partner service we are connecting is experiencing performance issue recently and was failing to return business data for processing. This data from partner service is required to be pushed to another partner system’s BI database for live reporting.
Lack of Built-in Retry Logic
The existing system is designed in such a way that it will pull the downstream service data once per day in the non- business hours. As this runs on a VM to avoid costing the machine is switched off when the data pull from downstream service ends. The same can be configured for retry-logic but the VM needs to run for whole day which was not a cost-optimized solution.
Proposed Solution
Azure Automation Runbook Overview
Azure Automation Runbook is a cloud-based automation service that allows you to create, monitor, and manage automated workflows. It supports PowerShell and Python scripts, making it versatile for various automation tasks. For our scenario we have used a PowerShell script automation that will be scheduled to run on a specific interval on all business days.
Implementing Auto-Retry Logic
To implement an auto-retry mechanism, you can need to do the following that involves:
- Identifying the failed Integrations: We have used the Azure SQL to register failed integration on the early morning run. We created a stored procedure that will return the failed integrations for current date
- PowerShell Runbook Script: Develop a script that includes the retry logic. This script should:
- Zero hardcoding of environment details in script, use Automation variables instead as best practice
- Query Azure SQL Db and identify the failed integrations
- Validate if the Azure VM where the Windows Task is configured is in started state, if not start the VM
- Run a command line script on the VM to start the Windows Scheduled Task
- Sleep for specific time so that the failed integration gets executed
- Stop the Azure VM so that it will not incur idle cost
Psuedo-code Runbook Script
PowerShell Runbook script with auto-retry logic:
$subscriptionId = Get-AutomationVariable -Name "subscriptionId" # Variable used to get the Azure Subscription ID
$sqlSvrName = Get-AutomationVariable -Name "Retry_sqlSvrName" # Variable used to get the SQL Server Name
$sqlDbName = Get-AutomationVariable -Name "Retry_sqlDbName" # Variable used to get the SQL Database Name
$sqlUserName = Get-AutomationVariable -Name "Retry_sqlUserName" # Variable used to get the SQL Server User Name
$sqlPassword = Get-AutomationVariable -Name "Retry_sqlPassword" # Variable used to get the SQL Server Password
# Variable used to get the name of stored procedure for failed count
$sqlSpGetFailedIntegrationsCount = Get-AutomationVariable -Name "Retry_sqlSpGetFailedIntegrationsCount"
# Variable used to get the name of stored procedure for failed integartions rows
$sqlSpGetFailedIntegrations = Get-AutomationVariable -Name "Retry_sqlSpGetFailedIntegrations"
# Variable used to get the name of stored procedure to update the failed integrations flags
$sqlSpUpdateFailedIntegrations = Get-AutomationVariable -Name "Retry_sqlSpUpdateFailedIntegrations"
# Variable used to get the name of virtual machine resource group
$vmresourceGroupName = Get-AutomationVariable - Name "Retry_vmresourceGroupName"
$vmName = Get-AutomationVariable - Name "Retry_vmName" # Variable used to get the name of virtual machine
$logicAppUrl = Get-AutomationVariable -Name "Retry_logicAppUrl" # Variable used to get the Logic App url
$logicAppName = Get-AutomationVariable -Name "Retry_logicAppName" # Variable used to get the Logic App Name
$sleepDuration = Get-AutomationVariable -Name "Retry_Sleep_General" # Variable used to get the Sleep in seconds after
$vmStarted = $false
$failedIntegrationsRetrieved = $false
$failedIntegrations = @{}
$executedIntegrations = @{}
$counter = 0
# Logging the start time of the script
$startDateTime = Get-Date
$cnnBuilder = New-Object System.Data.SqlClient.SqlConnectionStringBuilder
$cnnBuilder.psBase.DataSource = $sqlSvrName
$cnnBuilder.psBase.InitialCatalog = $sqlDbName
$cnnBuilder.psBase.UserID = $sqlUserName
$cnnBuilder.psBase.Password = $sqlPassword
$cnnBuilder.psBase.ConnectRetryCount = 3
$cnnBuilder.psBase.ConnectRetryInterval = 10
$cnnBuilder.psBase.ConnectTimeout = 600
$cnnBuilder.psBase.IntegratedSecurity = $false
$cnnBuilder.psBase.MultipleActiveResultSets = $true
function Open-SqlConnection {
[CmdletBinding()]
param (
[Parameter(Mandatory=$true)]
[System.Data.SqlClient.SqlConnection]$SqlConnection
)
# Variable to store the result
$result = 1
try
{
if ($SqlConnection.State -eq [System.Data.ConnectionState]'Closed')
{
$SqlConnection.Open()
}
}
catch {
Write-Output "An error occurred while executing the stored procedure: $_"
$result = -1001
}
# Return the result as an integer
return [int]$result
}
function Close-SqlConnection {
[CmdletBinding()]
param (
[Parameter(Mandatory=$true)]
[System.Data.SqlClient.SqlConnection]$SqlConnection
)
# Variable to store the result
$result = 1
try
{
if ($SqlConnectionn.State -ne [System.Data.ConnectionState]'Closed')
{
# Close the SQL connection
$SqlConnection.Close()
}
}catch{
$result = -1001
Write-Output "Failed to when closing the SQL connection that was opened earlier."
Write-Output "Error: $_"
}
# Return the result as an integer
return [int]$result
}
function Execute-Scalar-SP {
[CmdletBinding()]
param (
[Parameter(Mandatory=$true)]
[System.Data.SqlClient.SqlConnection]$SqlConnection,
[Parameter(Mandatory=$true)]
[string]$StoredProcedure
)
# Variable to store the result
$result = 0
try
{
$result = Open-SqlConnection -SqlConnection $SqlConnection | Out-Null
if ($result -ne -1001)
{
Write-Output "Open-Connection Returned - $result" | Out-Null
# Create the SQL command
$command = $SqlConnection.CreateCommand()
$command.CommandText = $StoredProcedure
$command.CommandType = [System.Data.CommandType]::StoredProcedure
# Execute the command and get the result
$result = $command.ExecuteScalar()
}
}
catch {
Write-Output "An error occurred while executing the stored procedure: $_"
$result = -1001
}
# Return the result as an integer
return $result
}
function Execute-Table-SP {
[CmdletBinding()]
param (
[Parameter(Mandatory=$true)]
[System.Data.SqlClient.SqlConnection]$SqlConnection,
[Parameter(Mandatory=$true)]
[string]$StoredProcedure
)
# Initialize $tableResults as an empty array
$tableResults = @()
try {
# Write-Output "About to check SQL Connection Status, if its not open then will open it"
$result = Open-SqlConnection -SqlConnection $SqlConnection | Out-Null
if ($result -ne -1001)
{
# Write-Output "Open-Connection Returned - $result"
# Create the SQL command
$command = $SqlConnection.CreateCommand()
$command.CommandText = $StoredProcedure
$command.CommandType = [System.Data.CommandType]::StoredProcedure
# Execute the command and read the results
$reader = $command.ExecuteReader()
while ($reader.Read()) {
$row = @{}
for ($i = 0; $i -lt $reader.FieldCount; $i++) {
$row[$reader.GetName($i)] = $reader.GetValue($i)
}
$tableResults += [PSCustomObject]$row
}
$reader.Close()
}
}
catch {
Write-Output "An error occurred while executing the stored procedure: $_"
throw
}
# Return the result
return $tableResults
}
<#
Step 1
1. Initializing the SQL Connection and Local varaibles with default values
2. Log the starting time of the Runbook
#>
$sqlConnection = New-Object System.Data.SqlClient.SqlConnection
$dictionary = New-Object System.Collections.Generic.Dictionary"[String,String]"
$sqlConnection.ConnectionString = $cnnBuilder.psBase.ConnectionString
Write-Output "Starting Time $startDateTime"
<#
Step 2: Calling the Stored Procedure to see if there are any failed integrations today
#>
$result = 0
Write-Output "About to execute the storedProcedure: $sqlSpGetFailedIntegrationsCount"
$result = Execute-Scalar-SP -SqlConnection $sqlConnection -StoredProcedure $sqlSpGetFailedIntegrationsCount
Write-Output "The result from the stored procedure is: $result"
<#
Step 3
1. Validate if Step 2 is Passed
2. Calling the Stored Procedure to get details of all the failed integration names for today
#>
if (($result -ne -1001) -and ($result -gt 0))
{
try
{
$failedIntegrations = Execute-Table-SP -SqlConnection $sqlConnection -StoredProcedure $sqlSpGetFailedIntegrations
$failedIntegrationsRetrieved = $true
<#
Step 4
1. Validate if Step 3 is Passed and the failed integartion is true flag.
2. Setting the Azure Contexts
3. Start the Azure VM where Schedule tasks are hosted
#>
if ($failedIntegrationsRetrieved -eq $true )
{
try
{
$vmStarted = $false
Write-Output "Seems few of the integrations failed."
Write-Output "This will now start the VM and will run the failed integrations."
Write-Output "Setting the Azure contexts start ..."
Write-Output "Calling the Disable-AzContextAutoSave..."
# Ensures you do not inherit an AzContext in your runbook
Disable-AzContextAutosave -Scope Process
Write-Output "Calling the Connect-AzAccount with managed identity..."
# Connect to Azure with system-assigned managed identity
Connect-AzAccount -Identity
# Connect-AzAccount -Identity -AccountId "ac31ba68-be92-4173-9e94-8ca9b1e0cee7"
Write-Output "Calling the Set-AzContext to set the current execution context"
# set and store context
$AzureContext = Set-AzContext –SubscriptionId "$subscriptionId"
Write-Output "Setting the Azure contexts end ..."
# Check the status of the VM before starting it.
# Check if Azure VM status, if the status is not running then start the VM
# Get the VM status
Write-Output "Calling Get-AzVM to get the status of the VM"
$vmStatus = Get-AzVM -ResourceGroupName $vmresourceGroupName -Name $vmName -Status
# Extract the power state
$powerState = $vmStatus.Statuses | Where-Object { $_.Code -like "PowerState/*" }
# Display the VM status
Write-Output "The current status of the VM '$vmName' in resource group '$vmresourceGroupName' is: $($powerState.DisplayStatus)"
# Step X : Start the VM as its stopped.
if ($powerState.DisplayStatus -ne "VM running" )
{
Write-Output "The Azure VM: $vmName is not in running state, starting the VM using Start-AzVM"
Write-Output "Executing the Start-AzVM to start the VM"
# Start the VM
Start-AzVM -Name $vmName -ResourceGroupName $vmresourceGroupName -DefaultProfile $AzureContext
Write-Output "The Azure VM: $vmName started successfully"
$vmStarted = $true
# # Wait for VM to start
# Start-Sleep -Seconds 30
Write-Output "Azure VM started successfully...."
}
else
{
Write-Output "VM is already running..."
$vmStarted = $true
}
}catch {
Write-Output "Error occurred when starting the VM: $vmName"
Write-Output $_
}
}
<#
Step 5
1. Validate if Step 4 is Passed and the Virtual Started flag is true.
2. Execute the Stored Procedure in DB to update the flag of the failed integrations
#>
if (($failedIntegrationsRetrieved -eq $true) -and ($vmStarted -eq $true))
{
$spUpdate = $false
try
{
Write-Output "Updating the status in the BatchReconcile table before rerunning the integrations in VM."
Write-Output "About to execute the storedProcedure: $sqlSpUpdateFailedIntegrations"
$failedIntegResult = Execute-Scalar-SP -SqlConnection $sqlConnection -StoredProcedure $sqlSpUpdateFailedIntegrations
Write-Output "Stored procedure executed successfully..."
$spUpdate = $true
}
catch
{
Write-Output "Error occurred when executing the SP: $sqlSpUpdateFailedIntegrations"
Write-Output $_
}
}
<#
Step 6
1. Check if the Stored Procedure Update flag is true from previous step 5.
2. Execute the Powershell script remotely to the Azure VM and run the failed Integrations one by one
3. As the Task Schedulers run Asynchronously provide buffer time complete the asychronous task. The
sleep time is configured in the sleepDuration variable, check the values in the Automation Services
properties for the value.
4. Store the success or failure of the individual integrations in a HashTable
#>
$hasBatchExecuted = $false
if ($spUpdate = $true)
{
try
{
$commandId = "RunPowerShellScript"
$taskPath = "\Schedules-DEV\Delta"
Write-Output "failedIntegrations: $failedIntegrations"
Write-Output $failedIntegrations.GetType().Name
foreach ($row in $failedIntegrations)
{
$row.PSObject.Properties | ForEach-Object {
try
{
if ($_.Name -eq "MethodToProcess")
{
$moduleName = "Sync_$($_.Value)_Delta"
$scriptStringVal = "Start-ScheduledTask -TaskName $moduleName -TaskPath $taskPath"
Write-Output "About to execute module $moduleName with Script $scriptStringVal"
# Execute the command on the VM
Invoke-AzVMRunCommand -ResourceGroupName $vmresourceGroupName -Name $vmName -CommandId $commandId -ScriptPath $null -ScriptString $scriptStringVal
Write-Output "Module $moduleName ran successfully..."
Write-Output "About to sleep for $sleepDuration"
# Wait for VM to start
Start-Sleep -Seconds $sleepDuration
$hasBatchExecuted = $true
$executedIntegrations[$_.value] = "Passed"
}
}
catch
{
$executedIntegrations[$_.value] = "Failed"
continue
}
}
}
}
catch
{
Write-Output "An error occurred:"
Write-Output $_
}
}
<#
Step 7
1. Check if the Task Scheduler run has atleast triggerred one integration by checking the hasBatchExecuted flag to true from previous step.
2. Call the Logic App configured to trigger the AxBus Datafactory Pipelines to sync AxBus DB Tables to BI DB Tables
3. This call to Logic App will be execute for the Integrations that has successfully exected in previous step.
All Failed integrations in previous step will be skipped.
#>
if ($hasBatchExecuted -eq $true)
{
Write-Output "Batchs has been executed, will call the Logic App to run the DataFactory Pipeline for passed integrations."
foreach($key in $executedIntegrations.keys)
{
if ($executedIntegrations[$key] -eq "Passed")
{
Write-Output "About to execute the Logic App: $logicAppName with Process Method: $key"
# Define the payload to send with the POST request
# Here, we're sending the Sync JSON payload. Adjust as needed.
$payload = @{
"ProcessMethod" = "$key"
"SyncType" = "Delta"
} | ConvertTo-Json
# Define headers if required
$headers = @{
"Content-Type" = "application/json"
}
# Send the HTTP POST request to the Logic App URL
try {
$response = Invoke-RestMethod -Uri $logicAppUrl -Method Post -Body $payload -Headers $headers
Write-Output "Logic App invoked successfully. Response: $($response | ConvertTo-Json)"
} catch {
Write-Error "Failed to invoke Logic App. Error: $_"
}
}
}
}
<#
Step 8
1. Gracefully close the SQL Connection opened for releasing memeory.
2. Check the Virtual Machine started flag to true, if so then stop the Azure VM gracefully.
3. Log the ending time of the Runbook
#>
$closeConnectionResult = Close-SqlConnection -SqlConnection $sqlConnection
if ($vmStarted)
{
try
{
Write-Output "The Azure VM: $vmName will be stopped using Stop-AzVM"
# Stop the VM
Stop-AzVM -ResourceGroupName $vmresourceGroupName -Name $vmName -Force -ErrorAction Stop
Write-Output "The Azure VM: $vmName stopped successfully"
$vmStarted = $true
# # Wait for VM to start
# Start-Sleep -Seconds 30
} catch{
Write-Error "Failed to when stop the VM, the error as below."
Write-Error "Error: $_"
}
}
}catch {
$failedIntegrationsRetrieved = $false
Write-Output "Error occurred when executing the SP: $sqlSpGetFailedIntegrations"
Write-Output $_
}
}
# Write-Output "All Tasks completed..."
# Logging the end time of the script
Write-Output "Completed Time $(Get-Date)"
Benefits of the Proposed Solution
Increased Reliability
By implementing an auto-retry mechanism, the reliability of automated workflows is significantly improved. Transient errors are handled gracefully, reducing the likelihood of complete workflow failures.
Reduced Manual Intervention
The auto-retry logic minimizes the need for manual intervention, allowing automated processes to recover from transient failures autonomously. This leads to increased efficiency and reduced operational overhead.
Enhanced Efficiency
Automated retry mechanisms ensure that tasks are completed successfully without unnecessary delays. This enhances the overall efficiency of the system, as tasks do not remain in a failed state for extended periods. This ensured zero manual intervention on any failure in downstream service.
Cost Savings
By reducing downtime and the need for manual intervention, organizations can achieve cost savings. Automated processes can run more smoothly, leading to better resource utilization and lower operational costs. We saw a 60% reduction in VM costing after this implementation.
Conclusion
With this approach we have seen a 60% reduction in VM costing, 0% manual intervention on any failures, reports generated on auto-retry enhanced the reliability of system among the various stack holders and business owners.
References
Tutorial: Create a PowerShell Workflow runbook in Automation
Tutorial: Create Automation PowerShell runbook using managed identity