Getting Misleading Results by Using the Wrong Technology

The case of the missing COVID-19 tests of Public Health England

There are many situations where people choose to use an easy and handy technology that is not fit for purpose in what they are using it for, resulting in hazardous results.

One of the most recent examples of this has just come into the spotlight, illustrating why you shouldn’t rely on Excel to execute data integration and consolidate data: how thousands of Coronavirus test results in England have apparently been “lost” due to Excel’s capacity limitations.

Excel uses a 16-bit number to represent the number of columns, with 1 bit to toggle visibility and 1 bit to allow editing. Therefore, there are 214 available columns, which equals 16,384. The number of rows is higher, allowing 220 = 1,048,576. These limits were kept in the Excel .xlsx format to maintain compatibility with software from the 90s and is a government-level requirement. Apparently, in this case, each individual COVID-19 test data was stored column by column, implying that after inputting 16,384 results the remaining were truncated because the Excel speadsheet had reached the maximum number of columns.

Although Excel is one of the most widespread, easiest and handiest do-it-yourself tool to analyse numerical data, it is not a reliable mean to ensure there is no mishandling of data. It is also not the most performant tool to handle and collate thousands to millions of rows of data, has no robust data input exception handling, lacks the ability to handle data workflows and many other types of tasks that would be expected from a modern enterprise data management tool. This makes Excel a non-reliable technology to ensure the quality of the data, which is the critical foundation of trustworthy results. Excel was designed to be a single user self-service data exploratory and analysis tool, not an enterprise data management tool.

In this particular case of COVID-19 data, with such a high degree of sensitivity and interest, similarly to most data management scenarios, an automated robust and reliable data management pipeline should have been put in place to ensure the results were trustworthy. This pipeline would extract the data from the repository which stores the individual tests submitted by medical staff and consolidate it into a trusted and secure data layer to be used as an integrated single source of the truth.

There are plenty of market leading tools that offer top notch data management capabilities, such as Informatica, Talend, Alteryx or a combination of Azure services (just to name a few), to extract, integrate, cleanse, transform and load data into a consolidated trustworthy database management system (DBMS), such as traditional SQL Server or cloud Azure SQL Database or Azure Synapse, as this type of combination offers the best practices and standards per excellence that we are sure would’ve provided trustworthy results, together with security, performance, scalability, adaptability and cost effectiveness, and are therefore, the adequate ones to use, versus an ad-hoc type data tool such as Excel.

The misuse of technology due to either lack of use or inadequate use is, unfortunately, widespread across both public and private sector. There is no way around it, if your data represents some of the golden nuggets of your business, why not put a proper machine in place to extract the most value out of it? Misusing technology due to misunderstanding the purpose of a toolset or technology, old legislation, or simply dismissing the importance of how data should be properly managed due to the default preference of using legacy software, should not be the way. In this situation, it certainly proved itself not to be the right way.

 

Written by Ricardo Santos and Tom Walker