Alerts

Alert: Windows Event Log

  • These come in a few forms.

Event Description: The device, \Device\Harddisk0\DR0, has a bad block.

  • This means this drive is failing and must be replaced.
  1. First check to see if the system is under warranty and proceed to make an warranty claim
  2. If it is not under warranty, first figure out which drive has failed
    • It can be difficult to figure out which drive it is, use Bad Block Error Drive Correlation to help determine it. The WMIC is most effective. 
  3. If it is a server, this is easier and you will want to use OMSA to determine this (or omreport storage pdisk controller=0 via SC)
  4. Note the current drive and find an appropriate replacement (likely the same as before).
  5. Put the details of:
    • What happened
    • The old drive model, physical size (3.5, 2.5, 1.8, etc), virtual size (e.g. 512GB), and connection type (SATA, SSD, NVMe, M.2 etc)
    • The new drive model, physical size (3.5, 2.5, 1.8, etc), virtual size (e.g. 512GB), and connection type (SATA, SSD, NVMe, M.2 etc)
    • into the ticket and set to ‘Quote Request’ status

Event Description: The driver detected a controller error on \Device\Harddisk2\DR2.

    • These can be very serious or quite benign. If this is a RAID controller or permanently attached device, it is cause for alarm and troubleshooting must occur
    • If the device is a CDROM, USB drive, memory stick, this can be noted as such in the ticket notes and set to completed. “Removable media being ejected, this is expected”

Event Description: A corruption was discovered in the file system structure on volume F:.

    • This means the drive has filesystem corruption. Generally from turning the PC off improperly, but sometimes from a bad drive. Run a chkdsk.
    • If executing the chkdsk from Sentinel or Screen Connect, it will hang, waiting for a Y/N input, you can get around it with this command. “fsutil dirty set c:” (or D: F: ect.) It will mark the FS as dirty and force a disk check upon the next reboot.

Event Description: Device failed: Physical Disk 0:0:1 Controller 0, Connector 0

Event Description: Virtual disk degraded: Virtual Disk 0 (SIYE-RAID) Controller 0 (PERC 5/i Integrated)

Event Description: Physical device removed: Physical Disk 0:0:1 Controller 0, Connector 0

Event Description: Physical disk offline: Physical Disk 0:0:1 Controller 0, Connector 0

  • All of the above are related. Merge them together
  1. Use OMSA to find a matched replacement (or omreport storage pdisk controller=0).
  2. Note the current drive and find an appropriate replacement (likely the same as before).
  3. Put the details of:
    • What happened
    • The old drive model, physical size (3.5, 2.5, 1.8, etc), virtual size (e.g. 512GB), and connection type (SATA, SSD, NVMe, M.2 etc)
    • The new drive model, physical size (3.5, 2.5, 1.8, etc), virtual size (e.g. 512GB), and connection type (SATA, SSD, NVMe, M.2 etc)
    • into the ticket and set to ‘Quote Request’ status

Alert: Clock Drift

  1. See Windows Server Time Issues for troubleshooting tips

Alert: Patch Status v2 on SERVER is Failed

These alerts can be caused for a number of reasons.

“Status of Patch Management Engine: [P501] Cannot connect to Patch Management Engine (PME) on N-central Agent.

  • Follow this guide: [[1]] and download the latest PME client.

Alert: Disk on [SYSTEM] is failed

  • This is generally for a drive being out of free space. Look for “Disk Usage: 82.00 %”. We trigger an alert at >80%
  1. Purge Windows Temp files
  2. Run the Windows Disk Cleaner for normal files
  3. Check if the system has multiple user profiles
    • If there is a former user profile, get the POC for the client to write in that the former staff’s profile can be erased. This must be in writing, from their email with both system name and profile name to be erased
  4. Use WizTree to see if other directories are using a large amount of space
  5. If a user’s directory is large (downloads, etc) let them purge things. DO NOT UNDER ANY CIRCUMSTANCES REMOVE A USER’S FILES. You can watch, they must perform the action.
    • IF they can’t erase things, that’s okay. Offer to quote them a bigger hard drive to solve their space issues

Alert: Reboot Required

  • Reboot the system after hours
  • These will be the responsibility of the On Call technician. Assign these to yourself and set to ‘scheduled’ and note in the ticket when you will perform the work.
  • DNR List — These systems should not be rebooted without advanced knowledge of the system
    1. VCC-STM-01
  • As we begin doing this routinely, we need a ‘post boot checklist’ created for each system to ensure all systems and services return to operation properly

Alert: Uptime on [system]

  • We will send a reboot in 4 hours
  1. In SC send a ‘shutdown /r /t 14400’
  2. Close the ticket whether system is on or off

Alert: Services Stopped

  • alerted as the PC was shutting down because the PC is slow.

Alert: System is offline

  • alerted as the PC was shutting down because the PC is slow.

AWS Alerts

Amazon CloudWatch Alarm

 Alarm Details:
 - Name:                       <EZM RDS SERVER NAME>-RDS-FreeStorageSpace
 - Description:                Free space on ezvenue-production-mysql8 RDS is below 50Gb. Increase storage space.
  • Log into the AWS console:
  • Go to RDS and go to DB Instances.
  • Click on the DB instance name from the alert
  • Click the Modify button in the top-right corner
    • Scroll down to the storage section and modify the storage to at least 150Gb of additional space.
    • Scroll down to the bottom and hit continue
      • On the next page, you’ll be prompted to choose a maintenance window or to apply the change immediately.
      • Click the Modify DB Instance button to apply the change.

Let’s Encrypt certificate Expiration

User email address is marked as invalid

Azure: Synchronization Errors in your Directory

Azure: Password Hash Synchronization heartbeat was skipped

  • Often the resolution email doesn’t come or doesn’t auto close the ticket. Check to see if it’s still an issue.
  1. Log into the Microsoft Admin Panel and view the Entra Connect Health
  2. Look under column Status, if it shows “Healthy”, everything is good. You may close the ticket.
    1. If it shows “no access” click on the section “Sync Error”, if it shows 0s for all categories everything is good. You may close the ticket.
  • If it still shows an alert, try restarting the service.
  1. Click Start, click Run, type Services.msc, and then click OK.
  2. Locate: Microsoft Entra Sync, right-click it, and then click Restart.
  • If this still doesn’t resolve the status, check here for further troubleshooting steps

Quota low on space

  • first check for redundant backups on STM-VBR-01. Make sure it’s only keeping the number of copies we need. Second, check for ways we can reduce the storage needed, aka: reducing the amount needed to be backed up.
Scroll to Top