In the fourth and final part of this series, the focus will be on virtual data modeling and a fundamental shift in thinking to achieve the impossible: a holistic and sustainable approach to handle company data.
The Connection Between Data and Its Source
Once a single data point is copied, the connection between the information and its source is severed. What may seem like a completely normal procedure carries some consequences, some of which can be significant. We have extensively described these in the previous articles of the series, including:
Loss of real-time data timeliness
Loss of permission management from the source system
Performance loss due to replacing powerful hardware and software with alternatives like local PCs or cost-effective data storage
Loss of automatable data retrieval
The steps where these drawbacks become immediately noticeable are often inconspicuous and sometimes trivial. It could be storing data in a spreadsheet, importing data into a dashboarding tool, moving data to a shared drive, corporate cloud storage, or data lake. In all these cases, the connection between the data and its source is lost.
„A virtual data space allows for entirely new approaches and modeling strategies.“
But even with less inconspicuous actions, such consequences follow. Some data platforms use additional persistence layers to optimize performance or decouple end-users from existing permission systems, for example, by using technical users. In these cases as well, the discussed drawbacks come into play immediately.
Preserving the connection between the data and its source not only avoids the mentioned disadvantages but also brings additional benefits. A virtual data space allows for entirely new approaches and modeling strategies.
Approach to Virtual Data Modeling
The approach to creating virtual data models fundamentally differs from that of traditional data modeling. While in data warehouse systems, the maximum required level of detail must be determined, modeled, and loaded right from the start to create "the one" complete data model, virtual data modeling uses an optimized data model for each data access.
This is possible because a virtual data model is merely a description of how to handle the data without containing the data itself. The rule of thumb for these data models is "as early as possible, as little as necessary." The relation of filters and granularity is crucial. If you want to see more details about a selection, you first filter according to the selection and then add the details. The performance of the source system is fully utilized to create these data slices.
„With this approach, almost any end user's use case can be implemented.“
With this approach, almost any end user's use case can be implemented because users will always need an aggregated view or only a subset of the data since it's impossible to grasp and interpret millions or billions of records at once.
The platform providing such a virtual data space must accordingly have the following capabilities:
Connectivity to the required sources
Fast and easy modeling options for end-users
Passing filters and parameters from the virtual model to the source system queries
No caching (persistence) in the virtual data space
Single Sign-On for all connected sources without using technical users at all levels of data access (even from cloud applications)
Addressing the data layer where the complete business logic is defined (often on the query level in data warehouse systems)
Examples of Modeling Approaches in the Virtual Data Space
Comparing very large datasets poses immense challenges for companies, especially in large migration or implementation projects. Often, data exists in both a legacy or old system and the subsequent or new systems. To verify the correct implementation of transition logic, comparisons between the two system environments are typically conducted.
However, as the data is stored in different data repositories, often without or with only proprietary interface technologies that are not suitable for exchanging very large datasets, the comparison can only be implemented with the help of additional persistences. Re-copying and moving large datasets are always a technical challenge and typically very time-consuming and costly. Naturally, these additional copies reintroduce the aforementioned disadvantages of redundantly stored data.
„When the data sets are too large, performing a complete comparison becomes impractical or even impossible.“
The core problem with classical data comparisons is that all data must be simultaneously accessible to detect deviations at every granularity level. When the data sets are too large, performing a complete comparison becomes impractical or even impossible, as the computation time and / or storage requirements exceed the limits of what is technically feasible.
Virtual Modeling Operators
In the virtual data space, we always consider only the data needed for a specific moment. We focus on appropriate data slices to optimize the data volume for the use case. In the case of data comparison, this might mean initially comparing the data at a high level of abstraction (e.g., for a calendar year), utilizing the aggregation capability of the sources. Only in the case of differences, we filter before increasing the level of detail.
This leads to a continuous drill down of the data, which is always based on a strictly limited amount of data at any given time. Since executing this process manually would be very cumbersome and almost impossible for a continuous repetition of the comparison, virtual data modeling provides its own and entirely new operators for this purpose.
The data comparison operator is an example of a specific modeling operator in the virtual data space. It takes two arbitrary tables as input, requires information on which components of the tables should be compared (key figures and / or attributes), and delivers as a result only the parts of the original tables that deviate from each other according to the defined comparison. The result of this operator can be used to filter subsequent queries with greater granularity.
Data Blending in the Virtual Data Space
In the virtual data space, the question of which data belongs together reaches a new dimension. Previously, this question was mainly defined by the availability of data in the respective system.
In the virtual data space, however, there is no longer the ONE system. There are no system or network boundaries. For example, ESG reporting can receive data directly from supplier systems, in addition to data from HR, purchasing, logistics, accounting, and production systems, along with virtual logic to determine CO² emissions or other complex KPIs.
The same applies to reporting for tracking supply chains and material availabilities. Also in this case, the systems of companies included in the supply chain can be used directly and in real-time for evaluations.
If the impact on key performance indicators (KPIs) needs to be assessed before the purchase or sale of companies or business units, the data of the affected units can be virtually integrated into one's own reporting.
These and many other possibilities contribute to a holistic and, above all, sustainable approach for handling company data.
Comentários