Generative AI Success Depends on Quality of Data

June 5, 2024 Cailin Boucher Forster

Our client IBM recently held its Think Conference where we convened a CIO Council session to discuss the areas of hybrid cloud, AI, and security. A reoccurring theme throughout IBM Think keynotes and the Council session was around how important the quality and governance of data are for the success of AI.

Generative AI has great potential as a transformative technology that will bring tremendous impact to future ways of working, and most organizations are experimenting with POCs and use cases and thinking about how to scale the technology. However, before moving too fast with this major investment, organizations should be thinking about data quality and working to ensure that good data hygiene processes are in place. As Ritika Gunnar, General Manager, Product Management, Data & AI, IBM emphasized at IBM Think, “Good AI begins with good data.” In our Customer Advisory Board work at Farland Group, we have heard similar perspectives around the data governance required for the safe, responsible use of generative AI.

“If you don’t have basic data hygiene, you will still fail regardless of the AI projects or security tools you invest in. Organizations must understand that if its data hygiene is bad, generative AI will only multiply your challenges 10 times more.”

Why is this so important? With Large Language Models (LLMs) trained on existing data comes the potential for an array of challenges, including bias, data leakage, and output accuracy. Depending on the usage, the consequences could be dire. A student using a public, free instance of gen AI to help write a paper runs the risk of not being able to quote sources and potentially citing inaccurate information. With an enterprise’s use of an LLM, particularly in a highly regulated industry, comes the greater concern of having confidential data leaked or using skewed, biased LLM output when making business decisions that could result in financial loss, regulatory non-compliance, and/or loss of trust among its customer base.

“If you don’t have strong data controls in place, your use of generative AI for any use case will not help you achieve the results wanted.”

Utilizing data controls when evaluating use cases is beneficial for more reliable outcomes that enable good decision-making and establish trust in using this new technology. Here are a few questions to ponder when looking to safely scale generative AI within your organization:

- Will your organization be using an internal or external LLM?

- Does your organization own the data used within the LLM? How confident are you with the accuracy of the datasets used to train the LLM?

- Who/what departments will have access to the LLM (and ultimately the data used within)?

- What guardrails/controls can be set to help manage your enterprise’s risk appetite?