Analyst predictions 2022: The future of data management

In the 2010s, organizations became keenly aware that data would become the critical ingredient in driving competitive advantage, differentiation and growth. But to this day, putting data to work remains a difficult challenge for many if not most organizations.

As the cloud matures, it has become a game changer for data practitioners by making cheap storage and massive processing power readily accessible. We’ve also seen better tooling in the form of data workflows, streaming, machine intelligence and artificial intelligence, developer tools, security, observability, automation, new databases and the like. These innovations accelerate data proficiency but at the same time add complexity for practitioners. Data lakes, data hubs, data warehouses, data marts, data fabrics, data meshes, data catalogs and data oceans are forming, evolving and exploding onto the scene.

In an effort to bring perspective to this sea of optionality, we’ve brought together some of the brightest minds in the data analyst community to discuss how data management is morphing and what practitioners should expect in 2022 and beyond.

In this Breaking Analysis, we’ll review the predictions from six of the best analysts in data and data management who will present and discuss their top predictions and trends for 2022 and the first half of this decade.

These experienced analysts include: Sanjeev Mohan, former Gartner analyst and principal at SanjMo; Tony Baer of dbInsight; Carl Olofso, research vice president with IDC; Dave Menninger, senior vice president and research director at Ventana Research; Brad Shimmin, chief analyst for AI platforms, analytics and data management at Omdia; and Doug Henschen, vice president and principal analyst at Constellation Research.

I believe that data governance is now not only going to be mainstream, it’s going to be table stakes. And all the things that you mentioned, the data, ocean, data lake, lake houses, data fabric, meshes, the common glue is metadata. If we don’t understand what data we have and we are governing it, there is no way we can manage it. So we saw Informatica went public last year after a hiatus of six years. I’m predicting that this year we see some more companies go public. My bet is on Collibra, most likely, and maybe Alation.

I’m also predicting that the scope of data governance is going to expand beyond just data. It’s not just data and reports. We are going to see more transformations like Spark, Python even Air Flow. We’re going to see more streaming data. Kafka Schema Registry, for example. We will see AI models become part of this whole governance suite.

The governance suite is going to be very comprehensive with detailed lineage, impact analysis, and then even expand into data quality. We’ve already seen that happen with some of the tools where companies are buying smaller firms and bringing in data quality monitoring and integrating it with metadata management, data catalogs, also data access governance.

So what we are going to see is that once the data governance platforms become the key entry point into these modern architectures, I’m predicting that the usage, the number of users of a data catalog is going to exceed that of a BI tool. That will take time but we already see that trajectory.

We opened up the prediction for comments and the following were noteworthy:

Doug Henschen, while generally agreeing with Sanjeev on the importance of governance, believes we’re still a ways off from mainstream. His opinion is that too few organizations practice good governance because it’s difficult and incentives are lacking. He did point out that ESG – environmental, social and governance – mandates may be a catalyst, as it has been in financial regulation. This will require tighter governance, but his feeling is we still have far to go before mainstream adoption.

Brad Shimmin added that he’d love to believe that data catalogs would be the answer, but to date, they’ve become metadata silos for specific domain use cases, such as cybersecurity or data quality. And penalties for noncompliance, such as fines, are often less expensive than fixing governance. But with new public policy emerging, we may see more strict guidelines put in place that will accelerate this prediction.

The idea of data mesh was first proposed by ThoughtWorks a couple of years ago and the press has been almost uniformly uncritical. A good reason for that is for all the problems that basically Sanjeev and Doug and Brad we’re just speaking about, which is that we have all this data out there and we don’t know what to do about it. Now, that’s not a new problem. That was a problem we had in enterprise data warehouses, it was a problem we had Hadoop clusters. It’s even more of a problem now that data is out in the cloud where the data is not only in your data lake, it’s all over the place. And it included streaming, which I know we’ll be talking about later. So the data mesh was a response to this problem. Basically data mesh is an architectural pattern and a process.

My prediction for this year is that data mesh is going to hit cold, hard reality. Data mesh is seen as a very revolutionary new idea. I don’t think it’s that revolutionary because we’ve talked about ideas like this. Brad, now you and I met years ago when we were talking about SOA and decentralizing all of us, but it was at the application level. Now we’re talking about it at the data level. And now we have microservices. So there’s this thought of if we’re deconstructing apps in cloud-native to microservices, why don’t we think of data in the same way? My sense this year is that enterprises are going to look at this seriously. And as they look at it seriously, it’s going to attract its first real hard scrutiny, it’s going to attract its first backlash.

That’s not necessarily a bad thing. It means that it’s being taken seriously. The reason why I think that you’ll start to see basically the cold, hard light of day shine on data mesh is that it’s still a work in progress. You know, this idea is basically a couple of years old and there’s still some pretty major gaps. The biggest gap is in the area of federated governance. Now, federated governance itself is not a new issue. With federated governance we started figuring out how to strike the balance between enterprise policy, consistent enterprise governance and yet put data in the hands of the groups that understand the data. And how do we balance the two?

There’s a huge gap there in practice and knowledge. Also to a lesser extent, there’s a technology gap which is basically in the self-service technologies that will help teams essentially govern data; through the full life cycle, from develop, from selecting the data from building the pipelines from determining your access control, looking at quality, looking at basically whether the data is fresh or whether it’s trending off course.

So my prediction is that it will receive the first harsh scrutiny this year. You are going to see some organizations and enterprises declare premature victory when they build some federated query implementations. You’re going to see vendors start to “data mesh-wash” their products, be it a pipelining tool, ELT [extract, transform, load process], a catalog or federated query tool. Vendors will be promoting how they support data mesh. Hopefully nobody’s going to call themselves a data mesh tool because data mesh is not a technology.

We’re going to see one other thing come out of this. And this harks back to the metadata that Sanjeev was talking about and the data catalog. There’s going to be a new focus on metadata. And I think that’s going to spur interest in data fabrics. Now data fabrics are pretty vaguely defined, but if we just take the most elemental definition, which is a common metadata back plane, I think that if anybody is going to get serious about data mesh, they need to look at the data fabric because we all at the end of the day need to read from the same sheet of music.

Generally the group was mixed on this topic.

Dave Menninger said we need to better define these overlapping terms we’ve been discussing, such as data mesh, data fabric and data virtualization. Menninger shared some survey data from Ventana on data virtualization, saying 79% of organizations using virtualized access to their data lakes were satisfied. Only 39% of those organizations not using virtualized access to their data lakes were satisfied.

Sanjeev Mohan has a different perspective. He said data mesh is already been defined along its four principles: domain ownership, data as product, self-serve data platform and federated computational governance. He proposes taking the discussion to another level. He also stressed that data mesh is a business concept, whereas data fabric is a data integration pattern. His point is that the two are really not comparable.

To that end, Mohan believes we need to take data mesh down to the level of understanding, for example, what does a data product look like, and how to handle shared data across domains and how to handle governance. He believes we’re going to see more operationalization of data mesh in 2022 — perhaps in the manner as we’ve reported with JP Morgan Chase and HelloFresh.

I regard graph database as the next truly revolutionary database management technology. I’m looking forward at the graph database market, which we haven’t defined yet, so I have a little wiggle room here. But this market will grow by about 600% over the next 10 years. Now, 10 years is a long time. But over the next five years, we expect to see gradual growth as people start to learn how to use it. The problem is not that it’s not useful; it’s that people don’t know how to use it. So let me explain before I go any further what a graph database is.

A graph database organizes data according to a mathematical structure called a graph. The graph has elements called nodes and edges. So a data element drops into a node, the nodes are connected by edges, the edges connect one node to another node. Combinations of edges create structures that you can analyze to determine how things are related. In some cases, the nodes and edges can have properties attached to them, which add additional informative material that makes it richer. That’s called a property graph.

There are two principal use cases for graph databases. There’s semantic property graphs, which are used to break down human language text into the semantic structures. Then you can search it, organize it and answer complicated questions. A lot of AI is aimed at semantic graphs.

Another kind is the property graph that I just mentioned, which has a dazzling number of use cases.

I want to just point out as I talk about this, people are probably wondering, well, we have relational databases, isn’t that good enough? Relational database support what I call definitional relationships. That means you define the relationships in a fixed structure. The database drops into that structure, there’s foreign key value that relates one table to another and that value is fixed. You don’t change it. If you change it, the database becomes unstable, it’s not clear what you’re looking at. In a graph database, the system is designed to handle change so that it can reflect the true state of the things that it’s being used to track.

So let me just give you some examples of use cases for this. They include entity resolution, data lineage, social media analysis, customer 360, fraud prevention, cybersecurity… supply chain is a big one. There is explainable AI and this is going to become important because a lot of people are adopting AI. But they want a system after the fact to say, how does the AI system come to that conclusion? How did it make that recommendation? Right now we don’t have really good ways of tracking that. There’s also machine learning in general.

And then we’ve got data governance, data compliance, risk management. We’ve got recommendation, we’ve got personalization, anti-money-laundering, that’s another big one, identity and access management. Network and IT operations is already becoming a key one where you actually have mapped out your operation, whatever it is, your data center, and you can track what’s going on as things happen there. There’s also root cause analysis, and fraud detection is a huge one.

A number of major credit card companies use graph databases for fraud detection, risk analysis, tracking and tracing turn analysis, next best action, what-if analysis, impact analysis, entity resolution. I would add one other thing or just a few other things to this list. Metadata management is important. I was in metadata management for quite a while in my past life, and one of the things I found was that none of the data management technologies that were available to us could efficiently handle metadata because of the kinds of structures that result from it. But graphs can. Graphs can do things like say, this term in this context means this, but in that context, it means that.

And also because it handles recursive relationships — by recursive relationships, I mean objects that own other objects that are of the same type — you can do things like build materials. For example, parts explosion. Or you can do an HR analysis, who reports to whom, how many levels up the chain and that kind of thing. You can do that with relational databases, but it takes a lot of programming. In fact, you can do almost any of these things with relational databases, but the problem is, you have to program it. It’s not supported in the database. And whenever you have to program something, that means you can’t trace it, you can’t define it. You can’t publish it in terms of its functionality and it’s really, really hard to maintain over time.

According to Omdia’s Brad Shimmin, graph databases have already disrupted the market. He points out that most banks are using graph databases to get fraud detection under control. And he says it’s the best and perhaps only way to truly solve many of the problems Carl mentioned. Shimmin says the Achilles heel of graph databases is they’re tied to very specialized and unique use cases.

Further, according to Shimmin, technologically graph databases are completely different. You can’t just stand up SQL and query them, for example. This makes scaling is an issue especially for a property graph because of its uniqueness, specialized metadata, complexity and data volumes. Olofson adds that because of this complexity, a single server can’t handle the problem, so the scope spans networks, which introduces latency.

Sanjeev Mohan adds that according to DB-Engines, in January of 2022 there are 381 databases on a ranked list of databases. The largest category is RDBMS. The second-largest category is actually divided into two: property graphs and IDF graphs. These two together make up the second-largest number of databases. So the other big problem is there are so many graph databases out there from which to choose.

I like to say that historical databases are going to become a thing of the past. By that I don’t mean that they’re going to go away, that’s not my point. I mean, we need historical databases, but streaming data is going to become the default way in which we operate with data. So in the next say three to five years, I would expect that data platforms — and we’re using the term data platforms to represent the evolution of databases and data lakes — will incorporate these streaming capabilities. We’re going to process data as it streams into an organization and then it’s going to roll off into historical databases.

Historical databases don’t go away, but they become a thing of the past. They store the data that occurred previously. And as data is occurring, we’re going to be processing it, we’re going to be analyzing it, we’re going to be acting on it. I mean, we only ever ended up with historical databases because we were limited by the technology that was available to us.

Data doesn’t occur in batches. But we processed it in batches because that was the best we could do. And it wasn’t bad and we’ve continued to improve and we’ve improved and we’ve improved. But streaming data today is still the exception. It’s not the rule. There are projects within organizations that deal with streaming data. But it’s not the default way in which we deal with data yet.

And so my prediction is that this is going to change, we’re going to have streaming data be the default way in which we deal with data and how you label it and what you call it. Maybe these databases and data platforms just evolved to be able to handle it. But we’re going to deal with data in a different way. And our research shows that already, about half of the participants in our analytics and data benchmark research are using streaming data. Another third are planning to use streaming technologies. So that gets us to about eight out of 10 organizations need to use this technology.

That doesn’t mean they have to use it throughout the whole organization, but it’s pretty widespread in its use today and has continued to grow. If you think about the consumerization of IT, we’ve all been conditioned to expect immediate access to information, immediate responsiveness. We want to know if an item is on the shelf at our local retail store and we can go in and pick it up right now. That’s the world we live in and that’s spilling over into the enterprise IT world. We have to provide those same types of capabilities.

So that’s my prediction: Historical databases become a thing of the past, streaming data becomes the default way in which we operate with data.

As Carl Olofson points out, all databases store history. He doesn’t expect that processing historical data will go away. We’re still going to have to do payroll and accounting and file tax returns. But in terms of the leading use cases, increasingly streaming will become more mainstream. Traditional methods and streaming will complement each other.

Tony Baer doesn’t see streaming becoming a default soon but he does see a convergence among streaming, transaction databases and analytic data platforms. He posits that the use cases are demanding these real-time capabilities and cloud-native architectures allow us to converge technically. For example, you can have a node doing real-time processing and at the same time predictive analytics, correlated with other customer data.

The consensus from the group is that streaming will become more important and a larger piece of the value equation. It will take some time before it’s truly the default model. Database types are converging and there’s a spectrum emerging where you have historical batch, near-real-time with low latency and real-time streaming to support new use cases such as AI inferencing at the edge.

I think that we’ve been seeing automation play within AI for some time now. And it’s helped us do a lot of things especially for practitioners that are building AI outcomes in the enterprise. It’s helped them to fill skills gaps, it’s helped them to speed development and it’s helped them to actually make AI better. In some ways it provides some swim lanes and, for example, with technologies like AutoML can auto-document and create that transparency that we talked about a little bit earlier.

But there’s an interesting kind of conversion happening with this idea of automation. And that is as we’ve had the automation that started happening for practitioners, it’s trying to move outside of the traditional bounds of things like trying to get my features, picking the right algorithm, building the right model. It’s expanding across that full lifecycle into building an AI outcome, to start at the very beginning of data and to then continue on to the end, which is this continuous delivery and continuous automation of that outcome to make sure it’s right and it hasn’t drifted and stuff like that.

And because of that, because it’s become very powerful, we’re starting to actually see this weird thing happen where the practitioners are starting to converge with the users. And that is to say, for example, if I’m in Tableau right now, I can stand up Salesforce Einstein Discovery, and it will automatically create a nice predictive algorithm for me given the data that I pull in. But what’s starting to happen – and we’re seeing this from the companies that create business software, such as Salesforce, Oracle, SAP and others – is that they’re starting to actually use these same ideals and a lot of deep learning to basically stand up these out-of-the-box flip-a-switch, and you’ve got an AI outcome at the ready for business users.

And I think that’s the way it’s going to go and what it means is that AI is slowly disappearing. I don’t think that’s a bad thing. I think if anything, what we’re going to see in 2022 and maybe into 2023 is this sort of rush to put this idea of disappearing AI into practice and have as many of these solutions in the enterprise as possible. You can see, for example, SAP is going to roll out this quarter this thing called adaptive recommendation services, which basically is a cold-start AI outcome that can work across a whole bunch of different vertical markets and use cases. It’s just a recommendation engine for whatever you need to do in the line of business. So basically, you’re an SAP user, you look up to turn on your software one day, you’re a sales professional, let’s say, and suddenly you have a recommendation for customer churn.

Boom! That’s great. Well, I don’t know, I think that’s terrifying. In some ways I think it is the future that AI is going to disappear like that, but I’m absolutely terrified of it because I think that what it really does is it calls attention to a lot of the issues that we already see around AI, specific to this idea of what we at Omdia like to call “responsible AI.”

How do you build an AI outcome that is free of bias, that is inclusive, that is fair, that is safe, that is secure, that is auditable, et cetera. So if you imagine a Salesforce customer, let’s say, and they’re turning on Einstein Discovery within their sales software, you need some guidance to make sure that when you flip that switch, the outcome you’re going to get is correct.

And that’s going to take some work. And so, I think we’re going to see this move, let’s roll this out and suddenly there’s going to be a lot of problems, a lot of pushback that we’re going to see. And some of that’s going to come from GDPR and others that Sanjeev was mentioning earlier. A lot of it is going to come from internal CSR requirements within companies that are saying, “Hey, hey, whoa, hold up, we can’t do this all at once. Let’s take the slow route, let’s make AI automated in a smart way.”

And that’s going to take time.

Shimmin also described a lack of standards that can serve as guidelines for companies to better understand AI. This is especially important for those companies that don’t have an internal data science team with the knowledge to understand when AI is embedded into a process or workflow, that the system is actually going to behave properly.

Olofson further pointed out that AI presents some tricky problems. In particular, humans are biased and the data feeding AI systems often create inherent biases. So when it comes to moral and legal issues, we need to be especially careful and not simply let the machines decide.

My prediction is that lakehouse and this idea of a combined data warehouse and data lake platform is going to emerge as the dominant data management offering. I say offering. That doesn’t mean it’s going to be the dominant thing that organizations adopt, but it’s going to be the predominant vendor offering in 2022.

Heading into 2021, we already had Cloudera, Databricks, Microsoft, Snowflake as proponents. SAP, Oracle and several of these fabric virtualization/mesh vendors joined the bandwagon. The promise is that you have one platform that manages your structured, unstructured and semistructured information. And it addresses both the BI analytics needs and the data science needs.

The real promise there is simplicity and lower cost. But I think end users have to answer a few questions.

The first is, does your organization really have a center of data gravity or is the data highly distributed? Multiple data warehouses, multiple data lakes, on-premises, cloud. If it’s very distributed and you’d have difficulty consolidating and that’s not really a goal for you, then maybe that single platform is unrealistic and not likely to add value to you. Also the fabric and virtualization vendors, the mesh idea, that’s where if you have this highly distributed situation, that might be a better path forward.

The second question, if you are looking at one of these lakehouse offerings and you are looking at consolidating, simplifying, bringing together to a single platform is the following: You have to make sure that it meets both the warehouse need and the data lake need. You have vendors like Databricks and Microsoft with Azure Synapse. These are really new to the data warehouse space and they’re having to prove that these data warehouse capabilities on their platforms can meet the scaling requirements, can meet the user and query concurrency requirements, can meet those tight service level agreements.

And then on the other hand, you have the Oracle, SAP, Snowflake, the data warehouse folks coming into the data science world, and they have to prove that they can manage the unstructured information and meet the needs of the data scientists. I’m seeing a lot of the lakehouse offerings from the warehouse crowd, managing that unstructured information in columns and rows. And some of these vendors, Snowflake a particular, are really relying on partners for the data science needs.

So you really have to look at a lakehouse offering and make sure that it meets both the warehouse and the data lake requirement.

Courtesy- https://siliconangle.com/2022/01/09/analyst-predictions-2022-future-data-management/

 

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.