#186 Business Continuity lessons learnt from CrowdStrike
The ISO Show - A podcast by Blackmores UK - Tuesdays
Categories:
In July 2024, A logic error in an update for CrowdStrike’s Falcon software caused 8.5 million windows computers to crash. While a fix was pushed out shortly after, the nature of the error meant that a full recovery of all effected machines took weeks to complete. Many businesses were caught up in the disruption, regardless of if this affected them directly or by proxy due to affected suppliers. So, what can businesses learn from this? Today, Ian Battersby and Steve Mason discuss the aftermath of the CrowdStrike crash, the importance of good business continuity and what actions all businesses should take to ensure they are prepared in the event of an IT incident. You’ll learn · What happened following the CrowdStrike crash? · How long did it take businesses to recover? · Which ISO management system standards would this impact? · How can you use your Management System to address the affects of an IT incident? · How would this change your understanding of the needs and expectations of interested parties? · How do risk assessments factor in where IT incidents are concerned? Resources · Isologyhub · ISO 22301 Business Continuity In this episode, we talk about: [00:30] Join the isologyhub – To get access to a suite of ISO related tools, training and templates. Simply head on over to isologyhub.com to either sign-up or book a demo. [02:05] Episode summary: Ian Battersby is joined by Steve Mason to discuss the recent CrowdStrike crash, the implications on your Management system and business continuity lessons learned that you can apply ahead of any potential future incidents. [03:00] What happened following the CrowdStrike crash?– In short, An update to CrowdStrike’s Falcon software brought down computer systems globally. 8.5 million windows systems, which in reality is less than 1% of windows systems, were affected as a result of this error. Even still, the damage could still be felt from key pillars of our societal infrastructure, with a lot of hospitals and transportation like trains and airlines being the worst affected. [04:45] How long did it take CrowdStrike to issue a fix? – CrowdStrike fixed the issue in about 30 minutes, but this didn’t mean that computers affected would be automatically fixed. In many cases applying the fix meant that engineers had to go on site to many different locations which is both time consuming and costly. In some cases Microsoft said that some computers might need as many as 15 reboots to clear the problem. So, a fix that many were hoping would solve the issue ended up taking a few weeks to fully resolve as not everyone has IT or tech support in the field to issue a manual reboot. A lot of businesses were caught out as they don’t factor this into their recovery time, some assuming that an issue like this is guaranteed to be fixed within 48 hours, which is not something you can promise. You need to be realistic when filling out a Business Impact Assessment (BIA). [07:55] How do you know in advance if an outage will need physical intervention to resolve? – There is a lesson to be learnt from this most recent issue. You need to take a look at your current business continuity plans and ask yourself: · What systems to you use? · How reliable are the third-party applications that you use? · If an issue like this to reoccur, how would it affect us? · Do we have the necessary resource to fix it? i.e. staff on site if needed? Third-parties will have a lot of clients, some may even prioritise those that pay a more premium package, so you can’t always count on them for a quick fix. [09:10] How does this impact out businesses in terms of our management standards? – When we begin to analyse how this has impacted our management systems, we can’t afford to say ‘We don’t use CrowdStrike therefore it did not impact us’ – it may have impacted your suppliers or your customers. Even if there was zero impact, lessons can be learned from this event for all companies. Standards that were directly affected by the outage were: · ISO 22301 – Business Continuity: Recovery times RPO and RTO; BIA; Risk Assessments · ISO 27001 – Information Security: Risk Assessment; Likelihood; Severity; BCP; ICT readiness · ISO 20000-1 – IT Service Management; Risk Assessment of service delivery; Service continuity; Service Availability Remember, our management systems should reflect reality and not aspiration [11:30] How do we use our Management Systems to navigate a path of corrective action and continual improvement? – First and foremost an event like this must be raised as an Incident – in this case it would no doubt have been a Major Incident for some companies. This incident will typically be recorded in the company’s system for capturing non-conformities or continual improvement. You could liken this to how ISO 45001 requires you to report accidents and incidents. From the Incident a plan can be created which should include changes to be considered or made to the management system. The Incident should lead us to conducting a lessons learned activity to determine where changes and improvements need to be made. We are directed in all standards to Understanding the Organisation and its context The key requirement here is to determine the internal and external issues that can impact your management system, and prevent it from being effective. Whatever method a company uses for this, perhaps a SWOT and PESTLE; the CrowdStrike/Microsoft Outage should be included in this analysis as a threat and/or Technical issue. [15:15] What are the lessons learned from our supply chain? – In many ISO Standards, such as ISO 9001 and ISO 27001, there is a requirement to review your suppliers and the effectiveness of the service they’re delivering. So you could send them an e-mail to ask how they have dealt with the issue, what actions did they take and how long did it take to fully restore services. This is a collaborative process that you can factor into your own risk assessments, as you can make a better judgement on future risk level if you are privy to their recovery plans. Many people still think of that requirement only in relation to goods and products. i.e. has my order been delivered ect. However, it relates to services such as IT infrastructure as well. You rely on that service, so evaluate how well it’s being delivered. [17:35] Join the isologyhub and get access to limitless ISO resources – From as little as £99 a month, you can have unlimited access to hundreds of online training courses and achieve certification for completion of courses along the way, which will take you from learner to practitioner to leader in no time. Simply head on over to the isologyhub to sign-up or book a demo. [19:50] Once you have established lessons learnt, what’s next? – The Standards provide a logical path to work through. One of the first steps is to conduct a SWOT and PESTLE, and doing so after a major incident is recommended, as your threats and weaknesses may have changed as a result. Do not simply put the sole blame on a third-party who an incident may of originated from. This is about your response and recovery, your plans coming into effect to deal with the situation, not about who is at fault. One such finding may be your lack of business continuity plans, in which case, looking at implementing aspects of ISO 22301 may be an action to consider. It’s also important to note down any positives from the incident too. You may have dealt with something very fast, communicated the issue effectively and worked with clients to ensure that their level of service was minimally impacted. If a team dealt with a situation particularly well, they should be recognised for that, as it really does go a long way. [23:55] The importance of revisiting your SWOT and PESTLE: These exercises shouldn’t just be a one time thing. You should be addressing these after incidents and any major changes within the business. Ideally, you should be looking at these in all your meetings, as many actions may need to be escalated to a strategic level. If you’d like to learn about how one of our clients embraced SWOT and PESTLE, and used it to their advantage, check out episode 53. [25:20] How has our understanding of the needs and expectations of Interested Parties been changed? - How has the Outage impacted the needs and expectations of interested parties? Understanding this might lead companies to ask questions about the robustness and effectiveness of different parts of the management system: · Risk Assessment · BIA for BCP · Recovery Plans · DR plans · Service Continuity [27:50] What should you be considering with your risks assessments? - Risk Assessments, if they follow the traditional methodology, with have Likelihood and Impact/Severity scores an in the light of this outage, and any event, the likelihood and Impact scores should be updated. If a company has set the likelihood as ‘once every 5 years’ it should seriously consider changing this to ‘once every 6 months’ or 'once every year’ to understand if this poses any new risks to the business. The likelihood score would of course be updated every year until it has recovered to ‘once every 5 years’. The impact is important to look at. If a company has been impacted by this outage, what has it cost the company to recover – talk to finance and other departments to understand the cost and change the scoring accordingly. [33:20] Why should a business carry out a risks assessment as part of lessons learnt? - Our risk assessments are not a one-off, but should be living documents that reflect the status of threats to the business. In ISO 27001 there is a statement to identify the ‘Consequences of unintended changes,’ and it could be argued that an Outage on the level of the CrowdStrike/Microsoft outage was an ‘unintended change that led to consequences in many businesses. So, use your risk assessments as live tools to report on the reality facing the organisation. Similarly, BIA assessments for BCP should be reviewed to determine if the assumed impact reflects the real impact; also look at the recovery plans to see if they are effective. If a recovery plan has stated that this type of incident could be recovered in 48 hours, and in reality it has taken 2 weeks, it means that recovery times in terms of RPO and RTO should be reviewed. Remember - your management system should reflect reality and not aspiration. If you’d like to book a demo for the isologyhub, simply contact us and we’d be happy to give you a tour. We’d love to hear your views and comments about the ISO Show, here’s how: ● Share the ISO Show on Twitter or Linkedin ● Leave an honest review on iTunes or Soundcloud. Your ratings and reviews really help and we read each one. Subscribe to keep up-to-date with our latest episodes: Stitcher | Spotify | YouTube |iTunes | Soundcloud | Mailing List