Successfully building privacy-aware AI software requires considering and classifying the data you plan to store in advance.

译自 Building Privacy-Aware AI Software With Vector Databases，作者 Zachary Proser。

GenAI creates personalized web experiences by combining proprietary data with individual user knowledge. How do we ensure that this knowledge is handled securely in accordance with security compliance standards?

How do we guarantee users the deletion of their personally identifiable information (PII)?

Let's look at the tools and patterns you can use to ensure that your app meets security and privacy standards.

Why RAG is the best architecture to ensure data privacy

Retrieval Enhancement Generation, an architecture that enriches GenAI responses with private data, is often used to address the shortcomings of large language models, including hallucinations and short contextual windows.

But RAG can also help us build privacy-aware AI systems that forget specific information about an individual on demand.

In order to comply with security standards, we need to ensure that user data:

characteristic	description
separate	Visible only to the owner, not to other users.
private	Not provided to LLMs through training or fine-tuning, only for inference or generation.
Can be deleted as needed	Users should be forgotten when they wish.

separate

Namespaces decouple user data and are suitable as security primitives.

privacy

With RAG, the data is only provided to the LLM as context at the time of generation, but the data does not need to be used to train or fine-tune the AI model.

This means that user data is not stored as knowledge in the model itself, but is only displayed to the GenAI model when it is requested to generate content.

RAG enables personalization while tightly controlling any PII used to generate user-specific responses.

Proprietary data, or PII, is shared with the LLM on a per-request basis and can be quickly removed from the system, making the information unavailable in future requests.

Delete as needed

When users wish to be forgotten, removing their data from the vector database index will cause the RAG system to no longer know about them.

Once the data is deleted, the LLM will not be able to answer questions about a given user or topic. The retrieval phase will no longer provide any such information to the LLM at the time of generation.

Compared to training or fine-tuning, RAG offers more flexibility in managing user-specific data, as you can quickly delete data for one or more entities from a production system without impacting system performance for other users.

Secure handling of customer data

Learn about the different types of data

Designing software to be privacy-aware requires an understanding of the risks associated with each type of customer data stored.

First, classify the types of data that need to be stored in the vector database. Specifically, identify which data is public, private, and which contains PII.

Build privacy-focused AI software using vector databases

Let's say we're building an e-commerce application that will store a combination of public, confidential, and PII data:

Public: Company name, profile picture, and job title. Private: API key, organization ID, purchase history. PII: Full name, date of birth, account ID.

Next, determine which data will be stored as vectors only and which data must be stored in metadata to support filtering.

Our goal is to strike a balance between storing as little PII as possible and providing a rich application experience.

Filtering with metadata is very powerful, but its simplest form requires storing private data, or PII, in plain text, so we want to be mindful of which fields are exposed.

With this understanding, we can consider each data type and apply the following techniques to handle it safely.

Isolate customer data in the index

Use separate indexes for different purposes. If your application manages natural language descriptions of geolocations and some personally identifiable user data, create two separate indexes, such as location and user.

The index is named based on what it contains. Think of an index as a top-level bucket of the stored data type.

Isolate customer data in namespaces

As we wrote earlier about building a multitenant system, namespaces are convenient and secure primitives for separating organizations or users in a single index.

Think of namespaces as entity-specific partitions in an index. If the index is a user, each namespace can be mapped to the name of each user. Each namespace stores only data related to its users.

Using namespaces can also help improve query performance by reducing the total amount of space that needs to be searched when returning relevant results.

Use the ID prefix to query the Content Fragment

Pinecone supports ID prefixes, a technique that appends extra data to the ID field of a vector when upsert so that you can later reference a "snippet" of content, such as all documents in page 1, block 23, or all vectors of user A in department Z.

ID prefixes are great for associating a set of vectors with a specific user so that you can effectively delete that user's data when they ask for it.

For example, imagine an app that processes restaurant orders so that users can find their purchase history using natural language:

index = pc.Index("serverless-index")

index.upsert(
  vectors=[
    {"id": "order1#chunk1", "values": [0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1]},
    {"id": "order1#chunk2", "values": [0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2]},
    {"id": "order1#chunk3", "values": [0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3]},
    {"id": "order1#chunk4", "values": [0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4]}
  ],
  namespace="orders"
)

The ID field can provide hierarchical labels for any combination that makes sense in your application.

This makes it easier for you to perform bulk delete and list operations:

# Iterate over all chunks from order #1
for ids in index.list(prefix='order1#', namespace='orders'):
    print(ids)

Using an ID prefix requires some upfront planning when designing your application, but it provides a convenient way to reference all the vectors and metadata associated with a particular entity.

Retrieval enhancement generation is great for deleting knowledge

Retrieval enhancement generation adds proprietary, private, or fast-changing data to LLM responses to establish it in authenticity and context-specific.

But it's also an ideal way to provide your end users with assurance about their right to be forgotten. Let's consider an e-commerce scenario where our users can use natural language to interact with the store, retrieve old orders, buy new products, etc.

In the following RAG workflow, a user's natural language query is first converted into a query vector and then sent to a vector database to retrieve orders that match the user's parameters.

Take the user's personal context (their order history) and some personally identifiable information at inference time and provide it to the generative model to fulfill their request.

RAG gives you control over what user data is presented to LLMs

What happens when you issue a bulk delete using the ID prefix scheme?

for ids in index.list(prefix='order1#', namespace='orders'):
  print(ids) # ['order1#chunk1', 'order1#chunk2', 'order1#chunk3']
  index.delete(ids=ids, namespace=namespace)

You've removed all system-specific content so that subsequent retrieval queries don't return any results – we've effectively removed what we know about our users from the LLM.

The ID prefix allows us to quarantine, tag, and later list or delete entity-specific data. This allows us to extend RAG into an architecture that provides guarantees regarding data deletion.

The safest data is the data you don't store

Tokenization to obfuscate user data

You can generally avoid storing personally identifiable information in a vector database altogether. Instead, you can protect your users by storing references to other systems or foreign keys, such as row IDs in a private database where you store full user records.

You can maintain a complete user record on-premises or in an encrypted and secure storage system hosted by a cloud service provider. This reduces the total number of systems that see your user data.

This process is sometimes referred to as tokenization, similar to how the model converts the words we send to the prompt into the word ID in a given vocabulary. You can explore this concept using the interactive tokenization demo here.

Let's say your application can provide a lookup table or a reversible tokenization process. In this case, you can write a foreign key to the metadata associated with the vector during upsert instead of a plaintext value that makes the user data visible.

A foreign key can be anything that makes sense to your application: a PostgreSQL row ID, an ID in a relational database where you keep user records, a URL, or an S3 bucket name that you can use to find additional data.

When upsert vectors, you can attach any metadata you wish:

index.upsert(
  vectors=[
    {
      "id": "order1#chunk1", 
      "values": [0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1], 
      "metadata": {"user_id": 3046, "url": "https://example.com"}
    },
    {
      "id": "order2#chunk2", 
      "values": [0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2],
      "metadata": {"user_id": 201}
    }
  ]
)

Use hashes to obfuscate user data

You can use hashes to obfuscate user data before writing it to metadata.

Hiding is not encryption. Obfuscating user data doesn't provide the same level of protection as encryption, but it does keep PII safe from accidental leaks.

Your application provides logic to hash user PII as metadata before attaching it to its associated vector:

There are many types of hashing operations, but at a high level, they transform input data into a series of characters that don't make sense in themselves but could be reversed or cracked by an attacker.

Your application can obfuscate user data in a number of ways, including insecure message hashes or base64 encoding, before writing values to metadata:

After the user data is hashed and stored as metadata, your application runs a query through the same hash logic to export the metadata filter values.

The vector database returns the results that are most relevant to your query, just like before.

Your application deobfuscates user data before it manipulates it or returns it to the end user:

This approach provides additional defense-in-depth. Even if an attacker has access to your vector store, they still need to reverse your application-level hash to get the plaintext value.

Encrypt and decrypt metadata

Obfuscating and hashing user data is better than storing them in plain text, but not enough to defend against skilled and motivated attackers.

Encrypting metadata before each update, re-encrypting query parameters to execute queries, and decrypting the final output of each request can be a significant overhead for your system, but it's the best way to ensure that your user data is secure and that your vector store knows nothing about the sensitive data of the queries it serves.

It's all about trade-offs, and you need to carefully weigh the performance penalty of continuous encryption and decryption, the overhead of maintaining security and rotating private keys, and the risk of leaking sensitive customer data.

Data retention and deletion in vector databases

If you follow the recommended convention of multitenancy by maintaining a separate namespace, you can conveniently delete everything stored in that namespace with a single operation.

To remove all records from the namespace, specify the appropriate deleteAll parameter for your client and provide a namespace parameter as follows:

index.delete(delete_all=True, namespace='example-namespace')