The CrowdStrike Outage: What Happened and Lessons Learned

The CrowdStrike Outage: What Happened and Lessons Learned

CrowdStrike is a leading cybersecurity company that provides cloud-based endpoint protection, threat intelligence, and incident response services to companies of all sizes worldwide. On July 19, 2024, CrowdStrike uploaded a flawed update to its Falcon Endpoint Detection and Response (EDR) software. The problem caused Windows devices to display Microsoft’s “Blue Screen of Death” (BSOD).

In all, roughly 8.5 million Windows devices were affected worldwide, disrupting sectors as diverse as airlines, finance, and healthcare.

How it happened

Falcon operates deep within the system architecture to monitor malicious activity in corporate workstations.

According to CrowdStrike’s Root Cause Analysis (RCA) report:

“The new IPC Template Type defined 21 input parameter fields, but the integration code invoked the Content Interpreter with Channel File 291’s Template Instances supplied only 20 input values to match against. This parameter count mismatch evaded multiple layers of build validation and testing, as it was not discovered during the sensor release testing process, the Template Type (using a test Template Instance) stress testing or the first several successful deployments of IPC Template Instances in the field…The attempt to access the 21st value produced an out-of-bounds memory read beyond the end of the input data array and resulted in a system crash.”

Because the Falcon sensor works differently on macOS and Linux systems, the update only affected Windows systems. Windows systems crashed when they tried to process the update, resulting in global outages.

Worldwide impact

The outage affected large and small businesses worldwide. Companies that rely on Windows-based systems, including airlines, healthcare, and financial companies, were particularly vulnerable.

Although the company had released a fix within 90 minutes of the incident, IT staff at affected companies had to perform the labor-intensive task of booting the systems manually into Safe Mode or the Windows Recovery Environment and deleting the offending Channel File 291. In some cases, they required physical access to the devices, and in others, they were hampered by the need to locate unique encryption recovery keys (for example, using BitLocker).

Repercussions of the outage

The road back to normal has not been easy for many of the affected companies. Experts estimate the damage caused by the outage will cost billions of dollars.

  • Airlines, including Delta, United, American, and KLM, had to cancel hundreds of flights until their scheduling systems were restored.
  • The outage also affected public transit systems in Chicago, New York City, and Washington, D.C.
  • Online banking systems and payment platforms around the world were affected by the outage.
  • Appointment systems for hospitals and clinics were disrupted.
  • CrowdStrike shareholders filed a class action lawsuit alleging CrowdStrike made misleading statements about the adequacy of its testing procedures.
  • In addition, Delta Air Lines said it would seek damages from CrowdStrike, saying their losses amounted to $500 million.

What is CrowdStrike doing to prevent such an occurrence in the future?

In its RCA, CrowdStrike outlined several problems with the rollout and offered several mitigation steps to prevent this problem from occurring again. Among them:

  • All updates will be tested internally and implemented in phases.
  • CrowdStrike customers can choose when they update to the newest version.
  • CrowdStrike will provide more testing, including validating the number of expected inputs and preventing out-of-bounds access in the sensors.

How can businesses better protect themselves from these types of outages?

It is important to remember that the CrowdStrike outage was not caused by a cybercriminal attack. Instead, it represented a failure of people, processes, and technology—both on the part of CrowdStrike and the affected companies. The CrowdStrike Falcon software remains a solid deterrent to cybercrime and hacking attempts.

While CrowdStrike has taken the initiative to fix the gaps in its software release processes, businesses can also learn from the experience to tighten their processes as well, including:

  • Mitigate system outages by having manual workarounds in place. There was a time when nothing was digital, and yet businesses thrived. Companies should establish processes for business continuity in the event of a major system failure.
  • Test system updates before installing them in production. As good as software development companies are, they can still make mistakes. While you always want your edge and intrusion detection system up-to-date, you also want to know if the update will cause your systems to fail. All mission-critical software updates should be tested in a staging environment before being deployed to production.
  • Be prepared. Critical business systems fail for many reasons, not all due to malicious cyberattacks. All businesses should anticipate a failure and be prepared to recover from it. This preparation includes regular backups, redundant systems, and regular practice drills to ensure the company can restore its data from backups.

ArcherPoint can help

If you need help developing your people, processes, and technology to ensure business continuity during a crisis, ask the experts at ArcherPoint.

We offer Managed IT Services that include remote monitoring of your infrastructure, setting up high availability solutions, endpoint and infrastructure security management, and data backup and disaster recovery services.

Trending Posts

Stay Informed

Choose Your Preferences
First Name
*required
Last Name
*required
Email
*required
Subscription Options
Your Privacy is Guaranteed