Root Cause: After reviewing our monitoring and database logs, the dev team determined that one of the schedule tables had been growing slowly in size over the last 30 days causing a gradual decay in performance. Table growth isn’t unusual, but this particular table is a maintained table that requires predictable record counts to satisfy schedule search performance requirements. The table size and subsequent temp table sizes reached a critical point where query performance dropped drastically, causing response times to exceed acceptable parameters, 20s - 60s. Normal response times for the search endpoint is expected to be at or below 500 - 750ms.
Impact: Schedule page and schedule search were unresponsive or very slow to load during the incident window, impacting operational teams' and customers' ability to see the schedule or to enroll new lessons. The issue was primarily being reported by internal users.
Response and Recovery: The dev team reviewed the affected queries responsible for the lagged endpoints, identified the data growth problem and immediately began clearing the unnecessary records from the table. Given the volume and size of the excess records, this process took about 30 minutes to complete as it required careful planning and scripting as not to cause further disruptions. This had an immediate positive impact on query and endpoint performance bringing both back within acceptable ranges.
Next Steps: We identified the source of the table bloating and have already released a fix to address the issue. We’re currently monitoring the table records for the efficacy of the fix. We additionally identified another potential problem query that a fix is pending for and will be deployed by EOD 4/15/25. Additionally we identified two additional database monitoring points and those have been created with alerting. We’re also investigating other possible monitoring options to gain better visibility and warning against endpoint latency issues.