The data dividend

18 April 2024

Fueling generative AI

Source: https://www.mckinsey.com/capabilities/mckinsey-digital/our-insights/the-data-dividend-fueling-generative-ai

The latest researchs estimates that generative AI could add the equivalent of $2.6 trillion to $4.4 trillion in annual economic benefits across 63 use cases.

Pull the thread on each of these cases, and it will lead back to data. Your data and its underlying foundations are the determining factors to what’s possible with generative AI.

That’s a sobering proposition for most chief data officers (CDOs), especially when 72 percent of leading organizations note that managing data is already one of the top challenges preventing them from scaling AI use cases. The challenge for today’s CDOs and data leaders is to focus on the changes that can enable generative AI to generate the greatest value for the business.

The landscape is still rapidly shifting, and there are few certain answers. But in our work with more than a dozen clients on large generative AI data programs, discussions with about 25 data leaders at major companies, and our own experiments in reconfiguring data to power generative AI solutions, we have identified seven actions that data leaders should consider as they move from experimentation to scale:

Let value be your guide. Clear about where the value lies and what data is necessary to deliver it.
Build specific capabilities into the data architecture to support the broadest set of use cases. Build relevant capabilities (such as vector databases and data pre- and post-processing pipelines) into the existing data architecture, particularly in support of unstructured data.
Focus on key points of the data life cycle to ensure high quality. Develop multiple interventions—both human and automated—into the data life cycle from source to consumption to ensure the quality of all material data, including unstructured data.
Protect your sensitive data, and be ready to move quickly as regulations emerge. Focus on securing the enterprise’s proprietary data and protecting personal information while actively monitoring a fluid regulatory environment.
Build up data engineering talent. Focus on finding the handful of people who are critical to implementing your data program, with a shift toward more data engineers and fewer data scientists.
Use generative AI to help you manage your own data. Generative AI can accelerate existing tasks and improve them along the entire data value chain, from data engineering to data governance and data analysis.
Track rigorously and intervene quickly. Invest in performance and financial measurement, and closely monitor implementations to continuously improve data performance.

1. Let value be your guide

In determining a data strategy for generative AI, CDOs might consider adapting a quote from President John F. Kennedy: “Ask not what your business can do for generative AI; ask what generative AI can do for your business.” Focus on value is a long-standing principle, but CDOs must particularly rely on it to counterbalance the pressure to “do something” with generative AI. To provide this focus on value, CDOs will need to develop a clear view of the data implications of the business’s overall approach to generative AI, which will play out across three archetypes:

Taker: a business that consumes preexisting services through basic interfaces such as APIs. In this case, the CDO will need to focus on making quality data available for generative AI models and subsequently validating the outputs.
Shaper: a business that accesses models and fine-tunes them on its own data. The CDO will need to evaluate how the business’s data management should evolve and determine the necessary changes to the data architecture to enable the desired outputs.
Maker: a business that builds its own foundational models. The CDO will need to develop a sophisticated data labeling and tagging strategy, as well as make more significant investments.

The CDO has the biggest role to play in supporting the Shaper approach, since the Maker approach is currently limited to only those large companies willing to make major investments and the Taker approach essentially accesses commoditized capabilities. One key function in driving the Shaper approach is communicating the trade-offs needed to deliver on specific use cases and highlighting those that are most feasible. While hyperpersonalization, for example, is a promising generative AI use case, it requires clean customer data, strong guardrails for data protection, and pipelines to access multiple data sources. The CDO should also prioritize initiatives that can provide the broadest benefits to the business, rather than simply support individual use cases.

As CDOs help shape the business’s approach to generative AI, it will be important to take a broad view on value. As promising as generative AI is, it’s just one part of the broader data portfolio (Exhibit 1). Much of the potential value to a business comes from traditional AI, business intelligence, and machine learning (ML). If CDOs find themselves spending 90 percent of their time on initiatives related to generative AI, that’s a red flag.

2. Build specific capabilities into the data architecture to support the broadest set of use cases

The big change when it comes to data is that the scope of value has gotten much bigger because of generative AI’s ability to work with unstructured data, such as chats, videos, and code. This represents a significant shift because data organizations have traditionally had capabilities to work with only structured data, such as data in tables. Capturing this value doesn’t require a rebuild of the data architecture, but the CDO who wants to move beyond the basic Taker archetype will need to focus on two clear priorities.

The first is to fix the data architecture’s foundations. While this might sound like old news, the cracks in the system a business could get away with before will become big problems with generative AI. Many of the advantages of generative AI will simply not be possible without a strong data foundation. To determine the elements of the data architecture on which to focus, the CDO is best served by identifying the fixes that provide the greatest benefit to the widest range of use cases, such as data-handling protocols for personally identifiable information (PII), since any customer-specific generative AI use case will need that capability.

The second priority is to determine which upgrades to the data architecture are needed to fulfill the requirements of high-value use cases. The key issue here is how to cost effectively manage and scale the data and information integrations that power generative AI use cases. If they are not properly managed, there is a significant risk of overstressing the system with massive data compute activities, or of teams doing one-off integrations, which increase complexity and technical debt. These issues are further complicated by the business’s cloud profile, which means CDOs must work closely with IT leadership to determine compute, networking, and service use costs.

In general, the CDO will need to prioritize the implementation of five key components of the data architecture as part of the enterprise tech stack:

Unstructured data stores: Large language models (LLMs) primarily work with unstructured data for most use cases. Data leaders will need to map out all unstructured data sources and establish metadata tagging standards so models can process the data and teams can find the data they need. CDOs will need to further upgrade the quality of data pipelines and establish standards for transparency so that it’s easy to track the source of an issue to the right data source.
Data preprocessing: Most data will need to be prepped—for example, by converting file formats and cleansing for data quality and the handling of sensitive data—so that generative AI can use the data. Preprocessed data is most often used to build prompts for generative AI models. To speed up performance, CDOs need to standardize the handling of structured and unstructured data at scale, such as ways to access underlying systems, and prioritize (or “preaggregate”) the data that supports the most frequent questions and answers.
Vector databases: Vectorization is a way to prioritize content and create “embeddings” (numerical representations of text meanings) in order to streamline access to context, the complementary information generative AI needs to provide accurate answers. Vector databases allow generative AI models to access just the most relevant information. Instead of providing a thousand-page PDF, for example, a vector database provides only the most relevant pages. In many cases, companies don’t need to build vector databases to begin working with generative AI. They can often use existing NoSQL databases to start.
LLM integrations: More-sophisticated generative AI uses require interactions with multiple systems, which creates significant challenges in connecting LLMs. Several frameworks, many of which are open source, can help facilitate these integrations (for example, LangChain or various hyperscaler offerings, such as Semantic Kernel for Azure, Bedrock for AWS, or Vertex AI for Google Cloud). CDOs will need to set guidelines for choosing which frameworks to use, define prompt templates that can be readily customized for specific purposes, and establish standardized integration patterns for how LLMs interface with source data systems.
Prompt engineering: Effective prompt engineering (the process of structuring questions in a way that elicits the best response from generative AI models) relies on context. Context can be determined only from existing data and information across structured and unstructured sources. To improve output, CDOs will need to manage integration of knowledge graphs or data models and ontologies (a set of concepts in a domain that shows their properties and the relations between them) into the prompt. Since CDOs will not have ownership of many data repositories across the business, they need to set standards and prequalify sources to ensure the data that is fed into the models follows specific protocols (for example, exposing a knowledge graph API to easily provide entities and relationships).

3. Focus on key points of the data life cycle to ensure high quality

Data quality has always been an important issue for CDOs. But the scale and scope of data that generative AI models rely on has made the “garbage in/garbage out” truism much more consequential and expensive, as training a single LLM can cost millions of dollars. One reason pinpointing data quality issues is much more difficult in generative AI models than in classical ML models is because there’s so much more data and much of it is unstructured, making it difficult to use existing tracking tools.

CDOs need to take two actions to ensure data quality: first, they need to extend their data observability programs for generative AI applications to better detect quality issues, such as setting minimum thresholds for unstructured content to be included in generative AI applications; and second, they need to develop interventions across the data life cycle to address the issues teams discover, primarily focusing on four areas:

Source data: Expand the data quality framework to include measures relevant for generative AI purposes (such as bias). Ensure high-quality metadata and labels for structured and unstructured data, and regulate access to sensitive data (for example, base access on roles).
Preprocessing: Ensure data is consistent and standardized and adheres to ontologies and established data models. Detect outliers and apply normalizations. Automate PII data management and establish guidelines for determining whether data should be ignored, retained, redacted, quarantined, removed, masked, or synthesized.
Prompt: Evaluate, measure, and track the quality of the prompt. Include high-quality metadata and lineage transparency for structured and unstructured data in the prompt.
Output from LLM: Establish the necessary governance procedures to identify and resolve incorrect outputs, and use “human in the loop” to review and triage output issues. Ultimately, elevate the role of individual employees by training them to critically evaluate model outputs and be aware of the quality of input data. Supplement with an automated monitoring-and-alert capability to identify rogue behaviors.

4. Protect your sensitive data, and be ready to move quickly as regulations emerge

Some 71 percent of senior IT leaders believe generative AI technology is introducing new security risk to their data. Much has been discussed about security and risk concerning generative AI, but CDOs need to contemplate the data implications in three specific areas:

Identify and prioritize security risks to the enterprise‘s proprietary data. CDOs must evaluate the wide-ranging risks linked to disclosing the company’s data, including the risk of exposing trade secrets when sharing confidential and proprietary code with generative AI models, and prioritize the most significant threats. Many current data protection and cybersecurity protocols can be expanded to mitigate specific risks related to generative AI. For instance, by incorporating pop-up reminders whenever an engineer intends to share data with a model, or by implementing automated scripts to ensure regulatory compliance.
Manage access to PII data. CDOs must establish regulations governing the detection and handling of data in the realm of generative AI. They should implement systems that integrate protective measures and human interventions to guarantee the removal of personally identifiable information (PII) during data preprocessing, prior to its utilization in an LLM. Using synthetic data (through data fabricators) and nonsensitive identifiers can help.
Track the expected surge of regulations closely. Generative AI has spurred governments to swiftly implement new regulations, like the European Union’s AI Act, which establishes various standards, including the requirement for companies to disclose summaries of copyrighted data used to train an LLM. Data leaders must stay close to the business’s risk leaders to understand new regulations and their implications for data strategy, such as the need to “untrain” models that use regulated data.

5. Build up data engineering talent

As enterprises increasingly adopt generative AI, CDOs will have to focus on the implications for talent.

Generative AI tools will manage specific coding tasks—AI is responsible for writing 41 percent of the code published on GitHub. This requires specific training on working with a generative AI “copilot”—a recent McKinsey study showed that senior engineers work more productively with a generative AI copilot than do junior engineers. Data and AI academies need to incorporate generative AI training tailored to specific expertise levels.

CDOs will also need to be clear about what skills best enable generative AI. Companies need people who can integrate data sets (such as writing APIs connecting models to data sources), sequence and chain prompts, wrangle large quantities of data, apply LLMs, and work with model parameters. This means that CDOs should focus more on finding data engineers, architects, and back-end engineers, and less on hiring data scientists, whose skills will be increasingly less critical as generative AI allows people with less advanced technical capabilities to use natural language in doing basic analysis.

In the near term, talent will remain in shorter supply, and we project that the talent gap will increase further in the near future, creating more incentives for CDOs to build up their training programs.

6. Use generative AI to help you manage data

Data leaders have a huge opportunity to harness generative AI to improve their own function. In our analysis, we have identified eight primary use cases across the entire data value chain where generative AI can expedite existing tasks and enhance task performance.

Many vendors are already rolling out products, requiring CDOs to identify the capabilities for which they can rely on vendors and which they should build themselves. One rule of thumb is that for data governance processes that are unique to the business, it’s better to build your own tool. Note that many tools and capabilities are new and may work well in experimental environments but not at scale.

7. Track rigorously and intervene quickly

There are more unknowns than knowns in the generative AI world today, and companies are still learning their way forward. It is therefore crucial for CDOs to set up systems to actively track and manage progress on their generative AI initiatives and to understand how well data is performing in supporting the business’s goals.

In practice, leaders track progress and identify root causes of issues by utilizing a set of core KPIs and operational KPIs (the underlying activities that drive KPIs) within effective metrics.

A core set of KPIs should include the following:

cost of additional components, such as vector databases and consumption of LLMs as a service
The integration of specific data sources with generative AI application workflows enables additional revenue.
time-to-market to develop a generative AI–powered application that requires access to internal data
end-user satisfaction with how the data has improved the performance and quality of the application

Operational KPIs should encompass tracking the most utilized data, assessing model performance, identifying areas of poor data quality, monitoring the volume of requests against specific datasets, and evaluating which use cases generate the highest activity and value.

This information is critical in providing a fact base for leadership to not just track progress but also make rapid adjustments and trade-off decisions against other initiatives in the CDO’s broader portfolio. By knowing which data sources are most used for high-value models, for example, the CDO can prioritize investments to improve data quality at those sources.

Effective investment, budgeting, and reallocation will depend on CDOs developing a FinOps-like capability to manage the entire new cost structure growing around generative AI. CDOs will need to track a new range of costs, including the number of generative AI model requests, API consumption charges from vendors (both quantity and size of calls), and compute and storage charges from cloud providers. With this information, the CDO can determine how best to optimize costs, such as routing requests by priority level or moving certain data to the cloud to cut down on networking costs.

The value of these metrics is only as great as the degree to which CDOs act on them. CDOs will need to establish data-performance metrics that they can review in near real time and protocols to make rapid decisions. They should extend effective data governance programs to incorporate generative AI–related decisions while remaining in place.

Data cannot be an afterthought in generative AI. Rather, it is the core fuel that powers the ability of a business to capture value from generative AI. But businesses that want that value cannot afford CDOs who merely manage data; they need CDOs who understand how to use data to lead the business.