Let’s Talk Chunks

Looks like I’ve inadvertently taken nearly two months off from blogging. In my defense, I was building stuff with AI rather than writing about building stuff with AI.

I thought I’d get back into the posting habit with some very practical posts on making your data AI-ready. Since my last write-up focused on helper documents, which can be great examples of unstructured data, let’s talk about the right kind of structure to make unstructured data AI-ready.

What Is Unstructured Data?

Unstructured data is content that doesn’t exist in rows and columns; think Word documents, images, audio recordings, blog posts, emails, message board conversations, transcripts of meetings. This is, of course, an astounding amount of information.

When you want an AI to work with large collections of unstructured data, the content generally needs some sort of processing. One key technique for processing and searching this content is RAG, and in most RAG implementations, documents are first broken into chunks during indexing.

Chunk, You Say?

No, I’m not referring to your favorite Goonie. Chunking is the process of breaking content down into smaller segments (i.e., chunks) which can be stored, indexed, and cross-referenced.

Example: your eighth grader is doing a school project on transportation. She finds five books on the subject and feeds all of them to her favorite AI tool. Then, for a quick query test, she asks for the best way to travel from Fort Worth to San Antonio.

The entire body of knowledge the AI ingested includes that Texans ride horses, snowshoes are a thing, and the Caddo tribe traveled in dugout canoes. The AI is also aware of cars, airplanes, Amtrak, and cross-country buses.

Vector Indexing

As these subjects are broken down into chunks, they’re indexed with vectors and stored in a vector database. A vector is a numerical representation of meaning, and vectors are stored in such a way that RAG queries can determine how close each chunk is to the question being asked.

Since the vector describes context and meaning, a chunk being “close” to the question indicates that the information in this chunk is likely relevant to the answer.

The role of RAG

When our student asks for the best way to travel from Fort Worth to San Antonio, the AI converts that question into a vector. A RAG query then identifies chunks that are closest to that question vector, and the AI uses the information in the chunks when generating an answer.

In this example, the question vector probably includes concepts such as “modern transportation,” “point-to-point intercity,” “highway travel,” and “Texas geography.”

The chunk regarding dugout canoes more likely has context such as “historical travel” and “water conveyance.” Likewise, the snowshoe chunk is more likely to focus on “recreational sport,” “winter wilderness,” and “mountaineering.”

Mathematically, neither of those vectors is very close to the question vector. However, the vector for Amtrak chunks probably includes “city pairs,” “long distance travel,” and “consumer travel.” This vector is a lot closer to the question vector, so the AI is likely to use Amtrak chunks in answering the question.

Why This Matters

Chunking is an art that you want to help your AI perform well. A chunk that’s too small might be missing relevant data, and too large of a chunk might dilute key information with surrounding noise. Chunking size is largely dictated by configuration and requires deliberate tuning.

Likewise, a little bit of structure goes a long way. A massive stream-of-consciousness dump with no structure at all leads to poor chunking strategy and low-quality vectors. When the vectors aren’t valid, a chunk can appear relevant to the question when it really isn’t.

Let’s say you ask me, rather than your AI, how to get from Fort Worth to San Antonio. My brain chunks through all my knowledge and says:

I saw a circus performance in Fort Worth once.
There were elephants.
Elephants were once used for transportation.
“Elephants” are relevant to both “transportation” and “Fort Worth.”

Clearly, one should travel across Texas via elephant.

Apparently when my brain chunked all this information, it created poor chunks with a lot of noise, which resulted in vectors that appeared relevant, but weren’t. In turn, that leads to irrelevant chunks being used during answer generation. You’ve probably seen this problem manifest as hallucinations or false reasoning by an AI in the past.

So, what can you do to improve chunking effectiveness? The answer has a lot to do with giving your unstructured data a bit of structure before it reaches your AI. I’ll talk about that in my next post. From there, I’ll go into metadata management, the key area where a little effort can have a significant impact on making your data AI-ready.

Hit Me With Your Best Prompt

Let’s Talk Chunks

What Is Unstructured Data?

Chunk, You Say?

Vector Indexing

The role of RAG

Why This Matters

Leave a comment Cancel reply

latest posts

Let’s Talk Chunks

Plan For High Quality

Stay Out Of The Corner

Anything For You

ChatGPT Year In Review

Object Oriented Thinking

Jive Coding, Part 3

Jive Coding, Part 2

Jive Coding with ChatGPT

Here Comes The Litigation…

categories

subscribe to my blog

Let’s Talk Chunks

What Is Unstructured Data?

Chunk, You Say?

Vector Indexing

The role of RAG

Why This Matters

Share this:

Leave a comment Cancel reply

latest posts

categories

subscribe to my blog