Facebook Outage: a single-point-of-failure lesson for Ghana

The recent six-hour-long Facebook global outage was not the first, neither was it the longest. In 2019, Facebook suffered a 14-hour-long disruption that affected Facebook and Instagram users globally. Indeed, prior to that, there had been other such outages which were on a smaller scale.

What is however unique about this current one is that, two other global social media platforms - WhatsApp and Instagram were also affected. This is because, through acquisitions and technical convergence, Facebook has created a single point of failure, and that has become a big source of worry for businesses around the world, and for industry regulators in the USA.

A day after the outage, Facebook issued a statement telling the world that the outage was caused "a faulty configuration change".

This is exactly what the Facebook statement said:

"Our engineering teams have learned that configuration changes on the backbone routers that coordinate network traffic between our data centers caused issues that interrupted this communication. This disruption to network traffic had a cascading effect on the way our data centers communicate, bringing our services to a halt."

In short, they have linked/merged all their data centers for Facebook, Facebook Messenger, WhatsApp and Instagram, and so data transfer between the data centers was hampered in the process and that affected all their platforms.

The magnitude of the outage was that a whopping 3.5 billion users and businesses on its platforms were affected. Outage tracking firm, Downdetector reported that there were over 10.6 million problem reports globally. Indeed, Facebook itself admitted that the outage affected the emails and access accounts of its own staff, making it difficult for them to even enter the system and fix the problem on time.

Facebook Founder and CEO, Mark Zuckerberg reportedly lost a whopping US$6 billion due to Facebook share plummet as result of the outage. And small businesses and influencers who use Facebook platforms for their trade also reportedly lost at least US$5,000 each on the average.

As stated above, this is not the first and biggest outage on Facebook. But its significance lies in the fact that now Facebook has created a humungous single point of failure; which means a hitch at one point on its system will affect not just the Facebook platform, but also WhatsApp and Instagram. The current outage was a clear example.

Convergence, at all levels, has become a popular strategy for businesses lately. And the reasons are for cost cutting and effective management from a single point rather than from multiple points. But it also creates a challenge where one small problem, which was hitherto localized and could only cause a limited impact, now becomes a global challenge.

Globally, there are almost 3 billion Facebook accounts, plus over 2.5 billion WhatsApp accounts and 1.1 billion Instagram accounts. If all these platforms had their separate data centers managed separately, any challenge will still be big, but at least it would be localized to the specific platform. But the single point of failure at Facebook means all of these over 6 billion accounts are at a risk.

So, whereas Facebook has strategically and merged its platforms for effective management and cost cutting, industry regulators, particularly in the USA, are now getting worried about the risk a gargantuan single point of failure like the one at Facebook poses for businesses in particular. It is estimated that some 200 million big businesses in the USA run a greater part of the business on Facebook's platforms and such outages is a huge risk to their financial and other important data.

Another thing that has become apparent in recent times is Facebook's extreme selfishness in create a single point of failure to exploit its billions of users. The tech giant came under a global criticism for seeking to link user private details on WhatsApp to Facebook for easy exploitation. It has also been criticized for exploiting teenagers and minors psychologically via Instagram, plus several other lawsuits, regulatory fines and protests it has suffered across the globe for various forms of breaches.

ECG

Speaking of single point of failure, Ghana has a number of single points of failure and it has been affecting the economy since independence, and yet successive governments do not seem to care much about changing the status quo. What it rather happening is rather, government after government keep creating more single points of failure across various sectors.

The Electricity Company of Ghana (ECG) is a classic single point of failure in Ghana. They are the only power distributor in the country. Every power generator would have to go through ECG to get to the consumer. That monopoly has been one of the challenges to the country's development, yet in the wisdom of successive governments, that is the way to go, for several reasons, including the need to prevent privatization of power distribution and its cost implications for the final consumer.

So we have had to live with the impact of failings of ECG for all these decades. Once there is a small challenge at ECG, it does not matter how much power has been generated and is available for distribution, we all have live in darkness until there is a fix at ECG. This has caused lots of businesses huge moneys, and many homes have also suffered damages to electrical appliances without any compensation.

GhIPSS

Another single point of failure in Ghana is Ghana Interbank Payments and Settlements Systems (GhIPSS), that is a clearinghouse for all interbank, and currently digital finance interoperability transactions. GhIPSS is the only institutions that sits between banks, electronic money issuers (mobile money operators) and fintechs, to ensure that all cross platform transactions are seamless.

The risk however is, once there is a challenge at GhIPSS, all the cross platform transactions at the backend of cheque clearance, mobile wallet to wallet, wallet to bank, bank to wallet and others, will come to a halt for as long as the problem remains. Facebook's challenge lasted for six hours, and outages could take even longer.

ICH

The other single point of failure in Ghana is the Telecoms Interconnect Clearinghouse (ICH), which was established to replace what became popularly known as the peer to peer "spaghetti" interconnect arrangements between telcos. What used to pertain was each of the then five telcos in Ghana had separate peer to peer interconnect infrastructure to each other. So, one telcos has four separate connections to each of the other telcos and also to each international gateway and others within the ecosystem. It was "a mess" as regulators put it, even though telcos insisted that "mess" was working effectively.

What the ICH has done is to host data centers that connect all telcos and international gateways at one point for their interconnect traffic to flow through a single point. That way, when they go for reconciliation, the reference data is readily available at a single point, the ICH. This is a good thing, but could also create problems. When there is a challenge at the ICH, it means calls from one network to the other may not even get through. Again, there can be far reaching reconciliation problems if the ICH develops a fault. Hitherto, such challenges would have been localized between individual telcos.

There may be other single points of failure in the country. But let's stay with this three.

Two points are also worth noting in the Facebook example.

In Facebook's own statement, they did not rule out possible internal sabotage. So what it means is even people working in an organization that runs a single point of failure could intentionally tamper with the systems for whatever purpose, and end up creating problems for an entire country or the whole world.
Regulators observed that even though Facebooks stands out as a tech giant with all the infrastructure, tools and skilled personnel to prevent and or manage such challenges, the over six-hour global outage exposed a certain weakness at Facebook.

In the light of the forgoing, the single points of failure in Ghana should be on high security. They should invest heavily into security and cybersecurity to ward off saboteurs. Secondly, they need to keep updating their systems and infrastructure regularly to prevent a situation where emerging challenges/innovations outpace the systems at the single point of failure. That cannot happen. That would not be acceptable, particularly given what we have all seen happen with Facebook.

Redundancies

One sure way or curing single point of failure is to have a number reliable redundant infrastructure, such that when there is a problem on the primary system, traffic or the work load can be transferred seamlessly unto the redundant infrastructure. But to the extent that the redundant equipment sits with the same organization, when something goes wrong with the switch over, it can still lead to the kind of outages we suffered on Facebook and the ones we have been experiencing in ECG, even though they both have redundancy infrastructure.

It is also important for industry players and regulators to continue to work as partners to prevent unforeseen challenges. Expertise in managing this space are not only within the remit of regulators or players. There are resources across the ecosystem, which can be harnessed strategically to ensure the greater good of the industry and the country.

For the ordinary Ghanaian on Facebook, WhatsApp and Instagram - this should be a lesson that life is bigger than social media. It is good to link up with friends and family on social media and also good to reach out to a global market via social media. Indeed, Covid has taught us that we do not need a physical space to connect. But social media can also do to us what Facebook, WhatsApp and Instagram did to us recently. The simple message is that get a life outside of social media.