What caused the massive global Microsoft outage?

Hello everyone and welcome back to the Cognixia podcast!

Every week, we discuss something new and interesting from emerging digital technologies, hoping to inspire our listeners to learn something new and advance in their careers. Every week we also receive all your amazing feedback and suggestions that drive us to keep going, bringing you one more interesting episode after another. So, thank you to all our awesome listeners all over the world.

In today’s episode, we talk about the massive Microsoft outage that occurred on 19 July 2024. According to the Microsoft notification, CrowdStrike, an independent endpoint cybersecurity company released a new software update that began affecting IT systems globally. Although this was not a Microsoft incident, its biggest impact was on the Microsoft ecosystem.

Endpoint protection has been a buzzword in the world of cybersecurity for quite a bit now. Endpoint protection involves software running on local machines so they wouldn’t run malicious software or any unintended code. It is like a more modern name for the good old anti-virus and firewalls, sounds like it, no? It has two key components – a backend control center and an agent software which would be installed on the end point devices. And, if you haven’t guessed so far, endpoint devices are the user devices – mobile phones, laptops, desktops, etc. The endpoint protection agent software is constantly running on the endpoint devices. So, if you run a program or application that the agent feels needs to be prevented, a sensor would be notified by the operating system of the device and it will prevent the execution. The main endpoint application would also be notified about the blocked execution, which would further notify the control center, using the internet.

Simply put, this is a surveillance system of sorts. To be seriously effective they need to be deeply embedded into the operating system. It would also need to have the capability and requisite permissions to bypass lots of internal security systems.

So what happened exactly that more than 8.5 million systems were affected? Banking services came to a halt, countless flights were canceled, travelers were stranded, retail services came grinding to a stop, and an unimaginable number of workplaces were left staring at what is popularly called “The Blue Screen of Death”. While this number is less than 1% of Microsoft devices sold and operational globally, the broad economic and societal impact of even that 1% is unfathomable.

This is the first time such an incident has had figures, that too of this magnitude revealed. It is believed that this could be the worst cyber event in history. And, while the event is largely being labeled as a “Microsoft outage”, it is actually caused by an update that was rolled out by CrowdStrike, not Microsoft. The closest next big incident would be the WannaCry cyberattack of 2017 where over 300,000 devices were affected in over 150 countries. But do you see the difference between 8.5 million devices and 300,000 devices?

On 19 July at 04:09 UTC, CrowdStrike carried out a regular release of one such ‘sensor’ as a Windows device driver which would hook and attach deeper into Windows, one of the updates as part of the ongoing protection mechanisms of the Falcon platform. To do this, it would need special permissions, of course. These drivers would be written in C and C++, the same as the Windows kernels and core libraries. The configuration system triggered a logic error leading to a system crash and the blue screen of death or BSOD on impacted systems.

If this sounds too complex, don’t worry, we are going to simplify it further. When the CrowdStrike update was installed, the updated driver tried to access a memory address that didn’t exist in the system. This, as you would know, causes a NullPointer Exception or an NPE. A NullPointer Exception is a system exception that takes place when an application tries to access a memory location that does not exist.

Now, under normal circumstances, when an NPE occurs, Windows would simply kill such an application and protect itself. But, in this case, it was through a CrowdStrike sensor, which as we explained before, is so deeply nested in Windows and has the permission to bypass the safety mechanisms built in Windows that it took down the entire operating system. The drivers in question, after all, are not like the drivers you would install when you buy a new printer or a new webcam. Instead, they are a lot more sophisticated and they embed themselves much deeper into the system. They need to do so to function effectively as endpoint protection.

What caused the massive global Microsoft outage?

So, wouldn’t it be better if these risky software were written in say Rust or Go instead of C and C++, since the latter would be better able to prevent these errors? Well, the Windows APIs that are used to write this software are not ordinarily intended for third-party use. Do you know what is the one key thing common to antivirus software, system tweaking software, and endpoint protection software? They quite often use Windows APIs that can only be used by directly interacting with them during runtime and are often undocumented as well as unsupported. Now, you can understand that this can only be done using C and C++, which is why these are the two languages that are used to write this software.

Doesn’t Microsoft have any mechanisms to prevent this? Well, over the years Microsoft has gone to many lengths to make its systems more secure. Right from Windows XP to Windows 10, these security measures have been ramped up significantly. It also introduced the Windows Firewall and Windows Defender ages ago. Windows Defender is Microsoft’s alternative to CrowdStrike that does not depend on dangerous device drivers or risky implementations.

Would a Linux Operating System be safer than Windows in such scenarios? Not really. The entire principle that endpoint protection operates on requires it to nest or hook itself deep into the operating system. So, no matter which operating system you use, if you use endpoint protection, this is one risk you will always be footing.

Could CrowdStrike have done something differently? Well, they could have employed stricter safeguards that are indispensable when building, rolling out, and operating such sensitive software as endpoint protection. Also, one of the most commonly known tenets of rolling out updates, or rather a rolling out update 101 is NEVER EVER ROLL OUT AN UPDATE ON FRIDAY. Like never, unless absolutely essential and critical. Even when you do, always have a rollback in place.

What can you do to protect yourself? We’d say rely on the security and safeguards provided by whatever operating system you are using instead of relying on third-party protections that essentially need to hook very, very deep into your operating system and require permissions to bypass the operating system’s inbuilt safeguards. Security can’t be an afterthought, it has to be built into the system, so count on that.

Microsoft jumped into action immediately, maintained transparent communication with all stakeholders, but the damage was done, so it was disaster management in progress. They shared something that is definitely food for thought, and we would like to share it with all of you before we end the episode.

“This incident demonstrates the interconnected nature of our broad ecosystem — global cloud providers, software platforms, security vendors and other software vendors, and customers. It’s also a reminder of how important it is for all of us across the tech ecosystem to prioritize operating with safe deployment and disaster recovery using the mechanisms that exist. As we’ve seen over the last two days, we learn, recover and move forward most effectively when we collaborate and work together. We appreciate the cooperation and collaboration of our entire sector, and we will continue to update with learnings and next steps,” said the statement from Microsoft.

The sheer scale of this outage and how something as simple as a new update can cause global chaos that takes days to clean up, billions of dollars of losses, and tons of issues that we can’t even begin to imagine is scary, isn’t it?

Well, we leave you with these thoughts and this is where we end this week’s episode of the Cognixia podcast. We will be back again next week with another interesting and exciting new episode.

Until then, happy learning and stay safe, both online and offline!

Workforce Transformation

Quick Link

Hire Skilled Talent

Quick Link

Upgrade Your Digital Skills

Quick Link

Get Hired

Quick Link

Industry

Quick Link

Application Development

Quick Link

Big Data and Analytics

Quick Link

Business Intelligence

Quick Link

Cloud and DevOps

Quick Link

Cyber Security

Quick Link

Development

Quick Link

Internet of Things

Quick Link

ITIL® and IT Service Management

Quick Link

Java/J2EE

Quick Link

Machine Learning and Analytics

Quick Link

Management

Quick Link

Microsoft Technologies

Quick Link

Mobile

Quick Link

Web Technologies

Quick Link

Master Class

Quick Link

Webinars

Quick Link

Workshops

Quick Link

Blog

Quick Link

Podcast

Quick Link

Tech News

Quick Link

Awards

Quick Link

Careers

Quick Link

Our Culture

Quick Link

Locations

Quick Link

Referrals

Quick Link