Preparing Your Data for AI | Cassidium Insights

The managing director had seen the demos. AI that could predict which customers would churn. AI that could recommend the next best action for sales reps. AI that could automate customer service responses.

He wanted that for his company. So he called a meeting with his IT team.

“How long until we can do this?”

The IT manager paused. “Have you looked at our customer database lately?”

The managing director hadn’t. When he did, he understood the pause.

Duplicate records everywhere. Customer names spelled three different ways. Key fields empty. Historical data in formats that nobody remembered defining. Contacts linked to companies that had merged, been acquired, or gone out of business years ago.

The AI demos worked because they used clean, well-structured data. His company had neither.

80%

Time Spent

Data scientists spend on data prep, not AI

$12.9M

Annual Cost

Average cost of poor data quality per company

73%

AI Projects

Fail due to data quality issues

Every AI vendor will tell you their product is powerful. Few will mention that power depends entirely on what you feed it. The algorithm is the easy part. The data is hard.

Dozens of AI initiatives stall every year. Not because the AI didn’t work, but because the data wasn’t ready.

What “AI-ready data” actually means

AI isn’t magic. It’s pattern recognition at scale. Feed it good data, and it finds useful patterns. Feed it garbage, and it finds patterns in the garbage.

For data to be AI-ready, it needs four qualities:

1. Complete

AI can’t learn from information that doesn’t exist. If half your customer records are missing industry classification, the AI can’t predict churn by industry. If historical purchase data only goes back two years, the AI can’t identify long-term buying patterns.

Completeness doesn’t mean every field in every record. It means the fields that matter for your use case are populated.

Warning

“Complete” for AI purposes is more demanding than “complete enough for humans.” A salesperson can make judgments with partial information. An AI model needs consistent data across enough records to find patterns.

A financial services firm wanted to predict which leads would convert. Their CRM had 50,000 lead records. Sounds like plenty of data.

But when they audited the fields that mattered for prediction (industry, company size, source, engagement history), only 12,000 records had all four fields populated. The rest had gaps.

They could train the model on 12,000 records (possibly too few) or include incomplete records (introducing noise). Neither option was good. The data preparation work they’d skipped was now blocking progress.

2. Accurate

Wrong data is worse than missing data. Missing data is obviously absent. Wrong data looks real and leads the AI astray.

Common accuracy problems:

Outdated information. The contact who left three years ago is still listed as the decision-maker. The company that was acquired is still shown as independent.
Entry errors. Revenue entered as $100,000 instead of $1,000,000. Dates in the wrong format, interpreted incorrectly.
Inconsistent standards. “Healthcare” and “Health Care” and “Medical” all meaning the same industry, but the AI treats them as three different categories.

Consider a manufacturing company whose CRM showed a customer churn rate of 5%. Excellent, if true. But when they audited, they found that “churned” customers had often just been forgotten rather than formally closed. Actual churn was closer to 18%. Any AI trained on the recorded 5% would produce dangerously wrong predictions.

3. Consistent

AI learns patterns by comparing similar things. If the same information is recorded differently in different places, the patterns blur.

Consistency issues:

Multiple fields for the same data. Phone number in three different fields, depending on who entered it.
Varying levels of detail. Some opportunities have detailed notes; others have none. Some contacts have full history; others are just names.
Different standards over time. Sales stages redefined twice, with historical data spread across old and new definitions.

40%

Typical Inconsistency

Records with format variations in key fields

3-6 months

Cleanup Timeline

To achieve AI-ready consistency

A professional services firm had changed their opportunity stages four times over ten years. Historical data included opportunities in stages that no longer existed. Some old records had been migrated to new stages; others hadn’t. AI analysis across this inconsistent history produced unreliable results.

4. Connected

Most useful AI use cases require combining data from multiple sources. Customer purchase history with support interactions. Marketing engagement with sales outcomes. Financial data with operational metrics.

If your data lives in disconnected silos, AI analysis is limited to what each silo can see.

Key Takeaway

The companies that get the most from AI aren’t the ones with the most data. They’re the ones whose data is connected and clean. Quantity without quality produces noise, not insight.

A retail company had rich customer data in their e-commerce platform and rich operational data in their inventory system. But the two systems didn’t share a common customer identifier. Matching records required manual lookup by email address, which worked for some customers but not others.

They wanted AI to predict inventory needs based on customer buying patterns. The AI capability existed. The data connection didn’t.

Common data problems that block AI

Knowing the theory is different from spotting the problems in your own data. Here are the issues we see most often:

Duplicate records

The same customer appears three times with slightly different information. John Smith at ABC Corp, J. Smith at ABC Corporation, and John S. at ABC.

Humans recognise these as the same person. AI models either treat them as three separate entities (fragmenting the pattern) or require matching logic that often misses edge cases.

The duplication multiplier

Mid-market companies typically have 15-30% duplicate records in their CRM. That’s not just a storage problem. Every duplicate fragments the customer’s history, making AI analysis less accurate.

Inconsistent categorisation

Industry classifications that vary by who entered them. Product categories that changed over time. Customer segments defined differently in different systems.

A technology company had “Enterprise,” “Enterprise Accounts,” “Enterprise (Large),” and “Large Enterprise” as separate customer segments in their CRM. They all meant the same thing. But any AI model would treat them as four different segments, diluting the patterns in each.

Missing historical data

AI models learn from the past to predict the future. If you only started tracking important data recently, there’s not enough history to find patterns.

Many companies get excited about AI sales forecasting but only have 18 months of CRM data. That’s often not enough cycles to identify seasonal patterns, let alone longer-term trends.

Unstructured text instead of structured fields

Notes, comments, and free-text descriptions contain useful information. But extracting it for AI analysis requires additional processing.

“Spoke with Sarah about renewal. She mentioned budget concerns and possible delay until Q3.” This is useful for a human reviewing the record. For AI analysis, “budget concerns” and “Q3 delay” need to be extracted into structured fields the model can use.

Time sync issues

Data from different systems captured at different times, or with different time zone handling, creates analysis problems.

An opportunity that shows as closed on January 31 in the CRM but doesn’t appear in revenue until February 2 in the finance system creates matching problems. Multiply this across thousands of records, and the AI’s ability to connect sales activities to outcomes breaks down.

Getting your data AI-ready

This doesn’t require a massive IT project. It requires attention to the data you have.

Define your use case first

Don’t try to make all your data perfect. That’s endless and expensive. Identify the specific AI use case you want, then figure out what data it requires.

“We want to predict customer churn” defines a specific data need: customer identification, purchase history, engagement metrics, support interactions, and churn events. You can focus on those fields.

“We want to do AI stuff” defines nothing. It leads to boil-the-ocean projects that never finish.

Audit the specific data you need

For your target use case, assess the current state of relevant data:

Completeness: What percentage of records have this field populated?
Accuracy: Sample-check records against reality. How often is the data correct?
Consistency: Are the same values recorded the same way across records?
History: How far back does reliable data go?

Be honest about what you find. The point isn’t to make the data look good; it’s to understand what you’re working with.

Clean and standardise

Address the issues your audit revealed:

Duplicates: Merge or link duplicate records. Establish processes to prevent future duplicates.
Categorisation: Standardise to a single taxonomy. Update historical records to match.
Missing values: Where possible, enrich from other sources. Where not possible, document the gap.
Format inconsistencies: Normalise formats (dates, phone numbers, currencies, etc.).

This is tedious work. It’s also essential. Skipping it doesn’t save time, it just delays the consequences.

Connect related data

Identify the connections your use case requires:

Customer IDs that link across systems
Time stamps that align
Reference data that matches

Build or document these connections. Test that they work reliably.

Establish ongoing data governance

Clean data doesn’t stay clean without attention. Establish:

Entry standards: How should new data be recorded?
Validation rules: What checks prevent bad data from entering?
Regular audits: How often is data quality reviewed?
Ownership: Who is responsible for data quality in each area?

This isn’t glamorous. It’s what separates companies that can use AI from those that keep restarting cleanup projects.

The minimum viable data set

You don’t need perfect data to start with AI. You need good-enough data for your specific use case, with a plan to improve.

Tip

A focused data set with 5,000 clean, consistent records often produces better AI results than a messy data set with 50,000 records. Quality beats quantity for most business AI applications.

For most business AI use cases, you need:

Customer master data: Unique identifiers, classification fields (industry, size, segment), key contacts. Clean and deduplicated.

Transaction history: Purchases, orders, contracts. With accurate dates, amounts, and links to customer records.

Engagement data: Interactions across touchpoints. Sales calls, support tickets, marketing responses. Timestamped and attributed.

Outcome data: What you’re trying to predict. Did the customer churn? Did the deal close? Did the support issue get resolved? Clearly defined and consistently recorded.

Consider a construction company that wanted to predict project delays. They had extensive data, but it was scattered. Project information in one system, resource allocation in another, client communication in email, issue tracking in spreadsheets.

Rather than trying to unify everything, they started with a minimum viable data set: project ID, original timeline, actual timeline, number of change requests, weather delays flagged. Five fields, pulled manually into a spreadsheet for 200 historical projects.

That wasn’t elegant. But it was enough to build a proof-of-concept model that identified change requests as the strongest delay predictor. The insight was useful. The data set was achievable.

Different AI use cases, different data needs

Different AI applications need different data quality.

Descriptive analytics (what happened)

Data needs: Historical records with reasonable accuracy. Gaps are tolerable if representative samples exist.

Example: Analysing which marketing channels produced the most customers last year.

Quality threshold: 70-80% completeness in key fields is usually sufficient.

Predictive analytics (what will happen)

Data needs: Consistent historical data over enough time periods to establish patterns. Accuracy matters more than for descriptive analysis.

Example: Predicting which current customers are likely to churn in the next 90 days.

Quality threshold: 85-90% completeness, with consistent data definitions over at least 2-3 years of history.

Prescriptive analytics (what should we do)

Data needs: High-quality data connecting actions to outcomes. Requires clear attribution of what interventions produced what results.

Example: Recommending specific retention actions for at-risk customers based on what’s worked before.

Quality threshold: 90%+ completeness, with accurate outcome tracking and action attribution.

Descriptive

70% Quality

Minimum data quality threshold

Predictive

85% Quality

Minimum data quality threshold

Prescriptive

90% Quality

Minimum data quality threshold

Generative AI (content creation)

Data needs: Less dependent on structured data quality. More dependent on having relevant content the AI can draw from.

Example: Generating personalised email responses to customer enquiries.

Quality threshold: Varies. The AI needs examples of good output and context about the business, but doesn’t require the same structured data rigour as prediction.

The role of your CRM

For most business AI applications, the CRM is the foundation. It’s where customer data lives, where interactions get tracked, where outcomes get recorded.

If your CRM data is messy, your AI initiatives will struggle. If your CRM data is clean and well-structured, AI becomes much more accessible.

This is why CRM data quality projects often aren’t about the CRM at all. They’re about enabling future AI capabilities.

The hidden AI readiness test

Here’s a quick test: could you pull a report right now showing all customers in a specific industry segment, their purchase history over the past three years, and their support ticket volume? If that takes five minutes, your data is probably AI-ready. If that takes five days (or is impossible), it’s not.

Common CRM data issues that block AI:

Incomplete contact records: Missing industry, company size, or role information
Outdated opportunity data: Historical opportunities without clear outcomes
Activity tracking gaps: Periods where interactions weren’t logged
Inconsistent custom fields: Fields that changed meaning over time
Integration failures: Data that should sync but doesn’t

Addressing these issues for AI readiness also improves everyday CRM use. The work serves multiple purposes.

Frequently asked questions

How much data do we need for AI?

More than most people expect for training models. Less than most people fear for getting started.

For custom AI models (training your own predictive algorithms), you need thousands of records with the outcome you’re predicting. A churn prediction model needs thousands of examples of customers who churned and didn’t churn.

For pre-built AI tools (applying existing models to your data), requirements are lower. You need enough data for the tool to work with, but you’re not training from scratch.

For generative AI, the “data” is often your business knowledge, processes, and content rather than structured records.

Start by defining your use case, then assess whether you have enough data for that specific application.

How long does data preparation take?

Usually 3-6 months for a focused initiative at a mid-sized company. That includes auditing current state, cleaning historical data, establishing standards, and implementing governance.

This assumes you’re focused on specific use cases, not trying to perfect everything. A boil-the-ocean approach takes much longer and often doesn’t finish.

Quick wins are possible in weeks: standardising a key field, deduplicating a segment, establishing entry standards. The full effort takes months.

Should we fix our data before exploring AI, or can we do both?

Parallel work is possible and often practical. While data preparation progresses, you can:

Experiment with AI tools on your cleanest data subset
Run proof-of-concept projects that reveal specific data gaps
Evaluate AI solutions to understand their data requirements
Build AI literacy in your team

Just don’t expect production-quality AI results until data quality improves. Early experiments show what’s possible. Production deployment requires the foundation to be solid.

What's the difference between data quality for reporting vs. AI?

Reporting tolerates exceptions. AI amplifies them.

A human reviewing a report can mentally adjust for a few miscategorised records or missing values. The aggregate trend is still visible, and the human applies judgment.

An AI model treats every record equally. A few hundred miscategorised records in a training set of 5,000 can skew the patterns the model learns. The AI doesn’t know which records to discount.

This is why data quality thresholds for AI are higher than for basic reporting. What’s “good enough” for a monthly pipeline review may not be good enough for an AI-driven forecast.

Do we need a data scientist, or can our existing team do this?

Data preparation doesn’t require a data scientist. It requires:

Understanding of your business data and what it means
Attention to detail for cleaning and standardising
Process discipline for establishing governance
Technical ability to work with your systems (CRM admin level, not developer level)

Many companies do initial data preparation with existing operations or CRM admin staff. Data scientists become valuable when you move to building and deploying actual AI models.

If you’re stuck at data preparation, the bottleneck usually isn’t technical skills. It’s attention, priority, and authority to enforce standards.

We tried a data cleanup project before and it didn't stick. What's different now?

Previous cleanups often failed because they were one-time projects without ongoing governance.

Clean data at a point in time degrades without maintenance. New records enter with inconsistent formatting. Standards get forgotten. Edge cases accumulate.

For AI readiness, the cleanup needs to be paired with sustainable governance:

Entry standards that are enforced (validation rules, required fields)
Regular audits that catch drift
Clear ownership for data quality
Consequences for not maintaining standards

The difference isn’t the cleanup. It’s whether the cleanup connects to ongoing practices that maintain quality.

The real barrier to AI

The companies that struggle with AI adoption often blame the technology. Too complex. Too expensive. We don’t have the expertise.

Usually, the real barrier is simpler: the data isn’t ready.

They want AI to predict churn, but they can’t identify who their customers are. They want AI to optimise sales, but their pipeline data is incomplete. They want AI to personalise marketing, but their contact records are duplicated and outdated.

The AI capabilities exist. The algorithms work. The tools are accessible. What’s missing is the foundation they require.

Fixing that foundation isn’t glamorous. It’s data audits, cleanup projects, governance policies. The tedious work that makes the exciting work possible.

But companies that do this work unlock capabilities their competitors can’t access. Not because they have better AI, but because they have better data. The AI just multiplies what was already there.

That managing director from the start of this article? His path is typical. It takes about six months of data work first: deduplication, standardisation, enrichment, governance.

But when companies like his finally deploy churn prediction, it works. Not perfectly, but usefully. Models identify at-risk customers with 70% accuracy, three months in advance. The customer success team has time to intervene. Retention improves.

The AI is the easy part. Getting the data ready is the actual work.

Planning an AI initiative? Get in touch and we’ll help you assess whether your data is ready.