What we learned from the CrowdStrike fiasco

The CrowdStrike incident of July 2024 has brought to light critical issues in cybersecurity and software management.

This article will not only delve into the details of what happened but also provide practical advice for small businesses. By learning from this incident, businesses can take steps to protect their digital infrastructures and better prepare for future threats.

What happened?

On July 19, 2024, CrowdStrike released a routine update that accidentally caused some Windows computers to crash. The issue affected businesses using CrowdStrike’s security tools. The company quickly fixed the problem and made changes to their processes to avoid this happening again. Most of the affected computers were back up and running within a few hours.

However, the incident had widespread repercussions, affecting various sectors globally, including healthcare, airlines, and financial institutions.

Medical facilities worldwide experienced significant disruptions. Non-urgent surgeries and medical procedures were canceled, and patient care was delayed. The inability to access electronic medical records and critical systems posed serious risks to patient safety and care continuity. It also impacted the aviation sector severely, with over 2,800 flights canceled or delayed.

Banks and financial services experienced disruptions, affecting digital services and customer transactions.

How did we let this happen?

The CrowdStrike incident highlighted the risks of relying heavily on a single cybersecurity firm. As a leader in endpoint security, CrowdStrike’s software is used by many Fortune 500 companies, which amplified the impact of the outage.

This reliance didn’t develop overnight. CrowdStrike’s reputation as a leader in endpoint detection and response (EDR) has been driven by its innovative approach, strong partnerships with resellers, and managed security service providers (MSSPs).

Many organizations prefer to consolidate their cybersecurity tools with a single provider to simplify operations, which can lead to dependency on that vendor’s services. CrowdStrike’s rich suite of cybersecurity tools makes it an attractive option for such companies.

Additionally, the increased need for robust cybersecurity due to sophisticated cyberattacks and the shift to remote work during the COVID-19 pandemic has strengthened CrowdStrike’s market position as a market leader.

Microsoft’s role and response

Microsoft was not responsible for the flaw in the CrowdStrike update, but since the issue affected its Windows operating systems, the company had to act swiftly to assist its customers and mitigate the impact. Microsoft acknowledged the problem and collaborated closely with CrowdStrike and other industry partners to provide immediate support to affected users.

They provided technical guidance and support to help safely bring disrupted systems back online. Microsoft deployed hundreds of engineers and experts to work directly with customers, assisting in restoring services.

They also collaborated with other cloud providers like Google Cloud Platform (GCP) and Amazon Web Services (AWS) to share information and coordinate efforts.

Additionally, Microsoft released a recovery tool to assist in booting and repairing affected systems and posted manual remediation documentation and scripts. They kept customers informed through the Azure Status Dashboard and other communication channels.

Microsoft’s proactive and collaborative approach was essential in addressing the incident’s immediate challenges and supporting the recovery process despite the initial accusations marking them as the main culprit.

Lessons learned

The Crowdstrike incident highlighted several critical lessons for the cybersecurity industry and IT infrastructure management. Here are the key takeaways:

1. Over-reliance on single providers

One of the major lessons from the Crowdstrike outage is the risk of relying too heavily on a single cybersecurity provider. The incident showed how a fault in one company’s update process could have a massive global impact.

People couldn’t do anything — from buying plane tickets, filling out PDF forms for job applications, patients receiving care in hospitals. The whole world ground to a halt.

Many organizations, including Fortune 500 companies, experienced significant disruptions because they depended on Crowdstrike’s Falcon platform for endpoint security. This event underscores the importance of diversifying cybersecurity solutions to avoid a single point of failure.

This incident is still relevant for small businesses that don’t use major enterprise software, as it reminds us of the importance of not putting all your trust in a single tool or service. Diversifying your tech and security tools and services can provide an extra layer of protection and prevent one problem from disrupting your entire business.

2. Importance of robust testing procedures

The Crowdstrike outage highlighted the critical need for rigorous testing procedures prior to deploying software updates. The incident occurred because a faulty update bypassed the existing cloud-based testing system, leading to widespread disruptions.

Comprehensive testing strategies must be implemented to prevent such issues. This includes automated, manual, and regression tests conducted in environments that closely mimic real-world production settings.

Additionally, incorporating black box pen-testing can play a crucial role in identifying potential vulnerabilities. This method involves simulating attacks from an external perspective, offering valuable insights into how attackers might exploit weaknesses in the system.

While your business may not be using Crowdstrike or other enterprise-level software, it’s still important to test how new software or updates may affect your systems. Even things like WordPress plugin updates can cause website failures, so testing updates in a sandbox testing environment can help avoid website failures.

Hedgehog reviewing flowchart of operations

3. Enhancing IT resilience

The incident emphasized designing IT infrastructure with resilience in mind. Prioritizing redundancy, failover systems, and disaster recovery plans can ensure business continuity in the face of unforeseen events.

A proactive approach to infrastructure resilience minimizes the impact of outages and maintains operational stability.

For small businesses, this incident highlights the importance of having backup systems and plans in place. Ensuring your IT infrastructure is resilient can help your business stay operational even when unexpected issues arise.

4. Kernel-level access concerns

The Crowdstrike incident reignited debates about granting third-party software kernel-level access to operating systems. While such access is often necessary for robust security measures, it poses significant risks if not managed properly.

Organizations should carefully evaluate the need for kernel-level access and consider alternatives that balance security requirements with system stability.

For small businesses, this incident highlights the risks of giving third-party software deep access to your computer systems. While it can be necessary for strong security, it’s important to carefully consider if such access is truly needed and to explore safer alternatives to maintain system stability.

5. Continuous improvement of incident response plans

Developing and regularly updating incident response plans is crucial. This includes conducting post-incident reviews to identify weaknesses and areas for improvement.

Final lessons

The CrowdStrike incident has underscored the urgent need for enhanced cybersecurity measures and effective risk management.

This incident should serve as a wake-up call for organizations to improve their security practices and increase the resilience of their digital systems worldwide. Continuous learning from past incidents, adapting to new challenges, and investing in ongoing education and training for teams are essential steps organizations must take to stay ahead of emerging threats.

For small businesses, the CrowdStrike incident emphasizes the importance of taking cybersecurity seriously and being proactive in managing risks. It serves as a reminder to continuously improve security practices, build resilient digital systems, and invest in regular training for your team. By learning from incidents like this, small businesses can better protect themselves against emerging threats and ensure they remain prepared for any future challenges.

by NameCheap