BlogOctober 8, 2020
Data in Times of Disruption
I think most of us would agree that 2020 is a year for the history books. We are experiencing many unusual situations: The COVID-19 virus, civil unrest, and fierce election battles that may impact many. Likewise, businesses and organizations are struggling to cope with rapid changes triggered by these disruptions. Still, when perusing through articles and advertisements for data services, business intelligence offerings and - albeit to a lesser degree, job offers – one could get the impression that it is still “business as usual” and “more of the same.” However, as time passes, the pressure to adjust to changes will continue to increase. Let’s try to analyze what this all of this means from a data point-of-view.
One big change is obvious: With the current restrictions on physically gathering (many) people in the same space, many employers in Administration, IT, Data and other more “hands-off” businesses have shifted from requiring their employees to work onsite all or most of the time to a 100% work-from-home model. Of course, this changes where, when, and how I access the data I need to be productive. All of the issues we discuss below apply to both onsite and remote work environments, but in the “new normal” of remote/work-from-home environments, there is an even greater urgency to tackle these questions. Fortunately, guiding principles have been developed that facilitate these discussions.
Data Governance is a set of principles, practices, and processes that help to ensure the formal management of data assets within an organization and provide for high quality through the complete lifecycle of data. It also deals with security and privacy, integrity, usability, integration, compliance, availability, roles and responsibilities, and overall management of the internal and external data flows within an organization. This includes the classification the data: Which data elements are considered to be sensitive or confidential? For instance, access to Personally Identifiable Information aka PII such as names, birthdays, and social security numbers should be strictly controlled. Specifically considering the organization of your data and determining who has access to what data are critical aspects that should be (re)evaluated immediately.
How does the data need to be organized so that it can be retrieved in the most effective way from both onsite and from home, yet still be well protected and secured? Further, how can the data be coordinated and/or consolidated so that there is only a “single source of truth” for each key business data element? In my recent article Become a Master of your Data we examined the basics of Master Data Management. Applying these principles will help prepare your data for efficient onsite and remote access. For instance, instead of giving your employees direct access to data distributed over multiple source systems, you could (re)organize your data following the Coexistence model (as described in the above mentioned article) and provide access only to a data warehouse where the consolidated data would now reside. This could also simplify the administration of data access while still allowing appropriate employees access to permitted data. Choosing the exact data organization model that fits your business depends on many factors that are beyond the scope of this article.
While deciding how to (re)organize your data, questions must be asked: Who has the right to access the data, who can modify or delete the data, or add to it, and from what locations can each data element be accessed? And will it be from onsite only, from home, or from any location? Is access provided only through a company-issued computer, can any computer be used, or is the data available through a public portal? Questions like these should encourage you to review who is responsible for determining who (or what - we should not forget that applications also need access to data even when run remotely!) can access what data, and who should implement those policies. This would be a good time to introduce the role of the Information Owner. This person typically holds a managerial position and is responsible for confidentiality, integrity, and availability of that information. She determines the classification of data, how and by whom the information will be used, and manages risks to prevent improper access or disclosure.
Collaborating with the Information Owner, the Information Custodian implements those policies by applying appropriate safeguards. This includes locking the data so that it can be accessed only by a person or application with the proper authorization. Also, implementing encryption both in the data warehouse and during the transmission between data warehouse and the user’s computer would fall into his realm.
Data Transmission and Storage
With many users of the data now being offsite, special attention should be given to the data transmission. When I am onsite and I am connected to my company’s network, I work in a secured environment; the company’s WIFI would be encrypted and secured with strong passwords, and the network itself would be secured behind powerful firewalls. But when I work from my home, I am connected to my home network. It is surprising that many consumers never change the default password on their home network router thus leaving the network highly vulnerable to attacks. This might be (barely!) tolerable when I use my home-network only to peruse the internet for cute cat pictures. But when I transmit confidential company information using my home network the situation changes dramatically. It is imperative that the data transmission be as secure as possible. Creating Virtual Private Networks (VPN) that encrypt the data stream end-to-end could be one solution. Or, my company could create a dedicated secure remote log-in to the company system and set the privileges such that no data can be copied out of that protected environment. Thus, I would be working virtually inside my company’s system without sitting physically in the office.
Even with good security and encryption in place for transmission and storage, some data might be too sensitive to be transmitted, stored, or displayed in full. For instance, it might be sufficient to use only the last 4 digits of a social security number instead of the full 9 digits. Masking the first five digits of this data element classified as PII would increase data security significantly. Always ask yourself: What is the minimum amount of data I need to transmit, store or display in order to achieve my business goals? If you don’t have strict policies in place answering this question you are introducing vulnerabilities.
Also, don’t forget to consider physical security: What would happen if your employee has stored highly confidential data on his company-issued computer, and that computer gets stolen? Is the data encrypted on the laptop’s hard disk? Can your IT department erase the data remotely? Whatever method is used to secure the data transmission and storage, now – with many working from their homes – is the time to act!
Now that we have the data well organized and adequately secured, let’s examine what to do with the data. Any data that a business collects serves (or, at least, should serve!) to support business processes and decisions. Hence the data needs to be analyzed and interpreted, that is, it needs to be converted into actionable Business Intelligence (BI). BI tools help analyze and interpret raw data and visualize the results. But to what degree can end users create visualizations and reports themselves or analyze the data using algorithms? Are those functionalities created centrally, and end users see only the results, and can they only run some pre-defined algorithms, reports, and visualizations? In the pre-Covid world, when many co-workers sat in the same building and probably even in the same room, it was easy to maintain and support these BI functions centrally. If a change was needed, I could simply go next door, sit down with my co-worker, discuss the changes, and quickly obtain the desired result. Now, with almost everybody working remote, an argument can be made that self-service BI could work better than the centralized model. Provided that sufficient domain expertise exists, the empowerment to create, modify and execute one’s own BI could work better in a distributed environment. However, it is imperative that all aspects of BI are included in the Data Governance and Data Security processes discussed above.
Junk In – Junk Out
There is a big elephant in the “Data Room” that currently very few are discussing: Many data algorithms depend on a (long) history of only gradually changing data. This applies to both “simple” models such as regression-based algorithms and “advanced” models such as Machine Learning. However, since the February - April timeframe of this year, every business and every organization experienced dramatic shifts in their operation, leading to significant data discontinuities. This poses a big problem. For instance, if your company’s business typically experiences annual cycles, and annual or multi-annual data is used to analyze the past and (try to) predict the future, your data world just fell apart and you have a problem. Now what?
One way to deal with these disruptions/discontinuities is to try to separate the short-term influence from long term trends. For instance, if you run a beverage store, your past analysis might have shown that you sell more beer in summer than in winter because it is warmer and sunnier during the summer months. And, although your business took a heavy hit this spring, that same cyclic trend might still be visible in the data. Hence your fundamentals are still the same albeit on a different level.
If that is not achievable, you might want to ask if you really need a long-term dataset to continue running your business successfully. Instead of crunching many months’ worth of data, can an iterative approach help you? Start analyzing a short time span, see what you can learn from that analysis, and adjust your business as necessary. Then include the newly acquired/current data into your analysis, learn, adjust…rinse and repeat! You might even find that taking a more detailed look at short-term data uncovers insights you previously missed because they were overpowered by long-term trends. But be aware of any bias that might be introduced by using a different sample in your current analysis compared from what you did prior to the disruptions. Since you have only the data from a shorter time span available, you might not see the “full picture.” If your sales usually pick up in spring and level off mid-summer, looking at the data from the March through July might give a misleading business outlook for the rest of the year.
Many pilots follow an interesting concept from which we could benefit. When something is not quite right (or goes completely wrong), they lower the level of automation and are thus able to better handle the problem. If an engine starts running rough, the very first thing many pilots would do is to switch off the autopilot. Doing so primes their minds to thoroughly analyze every aspect of the issue and to react faster because now they are flying the plane and are not just pushing buttons.
Translated to our situation this could mean that if a fully automated Artificial Intelligence algorithm requires training with a ton of long-term data to produce reliable results, perhaps a simple regression using the available shorter-term data might produce acceptable, reliable and – most important – actionable insights. A skilled data scientist should be able to come up with a concept to deal with these data discontinuities.
Change is a Chance
In any case, change is always a chance. What are the fundamentals of your business? Now is the time to evaluate what data is truly important to your organization. Who and what should access the data and from where? How is the data best analyzed?
Asking these questions and beginning to answer them will strengthen your organization in times of disruption and beyond.