Manually kickstarting a deployment

There are occasions that a Deployment can get stuck in a "PENDING" state.  There are a few possible reasons that this can occur.  Below is a list of reasons and steps that can kick start a deployment.

Reason #1 - Serial Lock

Background: System Events have the capability to be serially executed.  Deployment events are intentionally designed to use this serial execution capability.  That is to guarantee the consistency of the data being deployed.  The same order that you approve/deploy is the order that the data will be applied to your site/catalog.  To manage that serial process, a global lock is used.  The lock is managed in the blc_serial_event_lock table.  

Potential Issue: If a serial execution event is triggered on a node and that node is restarted, the lock can be stranded.  If the "SCHEDULED_DEPLOY" row in the blc_serial_event_lock table is marked as locked (e.g. lock_val = 1) and it is not getting cleared then the lock is stranded.  

select * from blc_serial_event_lock where key_val = "SCHEDULED_DEPLOY";

The Fix: If the serial lock is stranded, we just need to unlock it.  To unlock, the lock_val can be set to 0.

update blc_serial_event_lock set lock_val=0 where key_val = "SCHEDULED_DEPLOY";

Reason #2 - Deployment Master

Background: The system elects two "master" nodes that manage standard Scheduled Job scheduling (Job Master) and Deployment Job scheduling (Deployment Master).  These nodes are responsible for making sure that the Scheduled Jobs are picked up and processed.  The job master is tracked in the blc_schedl_job_master table.  Two rows should exist: Job Master (ID=1) and Deployment Master (ID=2).  The Job Master can be any type of node (site or admin) while the Deployment Master will only be an admin node.  Note that the IDs are always 1 and 2.  An associated table that is involved in the election process is the blc_registered_node table (or blc_node_registration for versions < 5.2).  The blc_registered_node table will list all the active nodes in the cluster.

Potential Issue: The nodes go through a "master election" process to designate one node for each of the master roles.  There are scenarios where old nodes remain elected in a master role when they should not.  This is rare but when this happens it is usually associated with an issue launching a new website release.

The Fix: Since there is a "master election" process, the system will self heal if a master node goes offline.  We can take advantage of that process by deleting the Deployment Master node.  The first step is to verify that the nodes in the blc_registered_node (blc_node_registration) table are accurate.  Second, the Deployment Master from blc_sched_job_master should match up to an active admin node.  What can happen is that there are still "old" nodes in the blc_registered_node table and the Deployment Master row in blc_sched_job_master is that old admin.

select * from blc_registered_node; -- for BLC 5.2+
select * from blc_node_registration; -- for BLC < 5.2
select * from blc_sched_job_master;
delete from blc_sched_job_master where sched_job_master_id=2; -- 2 is the deployment master

Reason #3 - Process 

Background: The system uses a process system to track the progression through the deployment steps.  The process row then provides the ability to re-start deployments should a system restart occur or other unexpected events.  

Potential Issue: On occasion a process restart might not get triggered after an unexpected event.  In that case we can force the process to get re-started.

The Fix: The framework has a table called BLC_PROCESS_STATUS that is used to keep track of the workflow of a change set.  This table works in conjunction with the BLC_SNDBX_WRKFLW_EVENT.SETUP_STATUS column. The ProcessStatus process checks BLC_SNDBX_WRKFLW_EVENT.SETUP_STATUS (which gets incremented from 100, 200, etc.) to make sure a change set is followed through to completion. However, there are times where this process will end due to an unexpected error on the server. This will leave the entity "stuck" but we can have the framework retry by inserting a new record in the BLC_PROCESS_STATUS table.

To manually kickstart a deployment you can perform the following steps:

1) Find the stuck deployment rows.  Ultimately we need the WORKFLOW_DEPLOYMENT_ID column but getting the full picture of that row can help:

select
  dep.WORKFLOW_DEPLOYMENT_ID,
  dep.STATUS,
  dep.DEPLOY_DATE,
  ev.ACTION_TYPE,
  ev.SETUP_STATUS,
  wrkitem.WRKFLW_SNDBX_ITEM_ID,
  wrkitem.GRP_DESCRIPTION,
  wrkitem.SCHEDULED_DATE,
  wrkitem.ARCHIVED_FLAG,
  wrkitem.CONTAINER_ID
from BLC_SNDBX_WRKFLW_DEPLOY dep
  inner join BLC_SNDBX_WRKFLW_EVENT ev on ev.WORKFLOW_DEPLOY_ID = dep.WORKFLOW_DEPLOYMENT_ID
  inner join BLC_SNDBX_WRKFLW_EVENT_ITEM evitem on evitem.EVENT_ID = ev.WORKFLOW_EVENT_ID
  inner join BLC_SNDBX_WRKFLW_ITEM wrkitem on wrkitem.WRKFLW_SNDBX_ITEM_ID = evitem.ITEM_ID
where
  (dep.STATUS = 'PENDING' or dep.STATUS = 'IN_PROCESS')

2) Once you have the BLC_SNDBX_WRKFLW_DEPLOY.WORKFLOW_DEPLOYMENT_ID id's for the stuck deployments, you can insert them into the BLC_PROCESS_STATUS table.  This will trigger the system to continue processing that deployment. 

For example, if we had 2 deployments in PENDING or IN_PROCESS, we would take the IDs (let's say our result set had 15880 and 15931 for WORKFLOW_DEPLOYMENT_ID) and insert two BLC_PROCESS_STATUS rows.  When manually inserting into the BLC_PROCESS_STATUS table, we want to use negative PROCESS_STATUS_ID's that do not already exist.  We use negative ID's so that we don't collide with the auto-increment value produced by Hibernate.  In this example we will try to use -100 and -101.  If they don't exist we can safely use them.

select * from BLC_PROCESS_STATUS where PROCESS_STATUS_ID in (-100, -101);

Once we verify it we can insert with those ID's, we will add the necessary BLC_PROCESS_STATUS rows.  NOTE:  The WORKFLOW_DEPLOYMENT_ID value from the BLC_SNDBX_WRKFLW_DEPLOY query will map to the BLC_PROCESS_STATUS.EXECUTION_ID column in our insert statement.

INSERT INTO blc_process_status (PROCESS_STATUS_ID, DESCRIPTION, EXECUTION_ID, EXECUTION_TYPE, FRIENDLY_NAME, IS_ARCHIVED, IS_PAUSED, LAST_UPDATED, NAME, PERCENT_COMPLETE, PROCESS_ID, VERSION) VALUES (-100, NULL, 15880, 'APPROVAL_PROCESS_TYPE', NULL, NULL, NULL, '2018-08-06 08:40:21:000', 'SCHEDULED_DEPLOY', NULL, -100, 1);
INSERT INTO blc_process_status (PROCESS_STATUS_ID, DESCRIPTION, EXECUTION_ID, EXECUTION_TYPE, FRIENDLY_NAME, IS_ARCHIVED, IS_PAUSED, LAST_UPDATED, NAME, PERCENT_COMPLETE, PROCESS_ID, VERSION) VALUES (-101, NULL, 15931, 'APPROVAL_PROCESS_TYPE', NULL, NULL, NULL, '2018-08-06 08:40:21:000', 'SCHEDULED_DEPLOY', NULL, -101, 1);

Once the BLC_PROCESS_STATUS rows have been inserted, the system should be able to continue with the deployment.