Ewwwh, this was a nasty little bug. Let me describe the title … I was having to update all our database data and schema to support the GMT/UTC baseline and then accounts and clients would register what timezone they are in and our system would reflect the dates and times from GMT/UTC to the registered timezone.
In order to support this, we had to make adjustments to the times on the servers. Update the database data, and more. For this little bug, I wanted to let you know that when I updated the schedule for all my SQL Server Agent Jobs, the schedule did not take effect.
- Issue: Old job schedules were still active (as visible by the Job Activity Monitor) even though the schedule says otherwise
- Takeaway: Run through all changed jobs in the Job Activity Monitor to make sure the Next Run Time reflects the new date/time.
- How to Fix: Go through each job schedule, disable the schedule, commit all dialogs, go back in and re-enable each schedule. This updates to the new date/time.
Let me describe what we did and didn’t do and how to watch out for it. Our environment is setup as such, some details are irrelevant, so I will keep them basic:
- Windows Server 2008
- SQL Server Enterprise Edition 2005
- Lots of RAM and HDD (SAN)
- Window and SQL Clustered
We also have the following environments:
- Prod – Production (detailed above)
- QA – Quality Assurance (same as above minus clustering)
- Dev – Development (same as above minus clustering)
Well, we made all the changes in Dev, rolled out to QA and all tests (eventually) succeeded. Then came time for rollout. We knew that clustering was the missing element in Dev/QA so we had to modify our rollout procedures to take that into account and make sure proper testing and checks were in place to compensate.
Here’s the kicker. We had some jobs that were running at 3:00 AM (MDT) and needed to be adjusted to 9:00 (UTC). Well, we followed these steps to rollout (which was from 12:00 AM (MDT) to 5:00 AM (MDT):
- Ran Full backups in the morning hours of the rollout day
- Run Diffs right before rollout
- Turn off the SQL Agent so any jobs that capture snapshots and do adjustments are all disabled
- Make all adjustments to server times, schema, and data
- Update the SQL Agent schedules to run on the adjusted times so that resources are available in our off-peak hours according to UTC and not the former MDT.
Alright, step 3 and 4 is what bit us in the butt. In QA and DEV on non-clustered servers, the SQL Agent service being disabled made the SQL Agent completely unavailable from Enterprise Manager. We couldn’t update schedules or anything else. We had to turn the SQL Agent back on, and then update everything. However, in production where there is a clustered server. Turning off the SQL Agent in the Cluster Administrator doesn’t completely shutdown the SQL Agent. It was still accessible from the Enterprise Manager.
We recognized this and took advantage of the opportunity to update the schedules while we were waiting for data to update (hours of time). This would make it so we didn’t lose any precious minutes running through 30 job schedules and getting them all updated before any of them ran.
Well, we recognized later that some of the jobs were still running on the old scheduled time. Looking in the Job Activity Monitor, the Next Run date/time reflected the old schedule!
So for example, a query to remove old historical records that was scheduled to run at 2:00 AM (MDT) was now supposed to run at 8:00 AM (UTC). The Next Run column in the Job Activity Monitor stated 2:00 AM. So I took offline the SQL Agent in the Cluster Administrator and put it back online thinking that the schedules were being cached. No go. Same old schedules.
I then decided to Disable and Enable the schedule on a job, refreshed the Job Activity Monitor and it now reflected the new time. Finally, the work around. The funny thing is though that some of the jobs reflected the new scheduled and the majority of them didn’t. I can’t recall if there were any tweaks I did that would make a difference, but …
What a nasty little bug. It was all because the SQL Agent was still “online” and available in my production Cluster environment. What I would do different? It is to not update the schedules until the SQL Agent is back online (in the clustered environment).