Skip to main content

2 posts tagged with "rust"

View All Tags

Beyond Vector Databases: The Case for Local Semantic Caching

· 6 min read
Founder of VCAL Project

Originally published on Medium.com on November 6, 2025.
Read the Medium.com version

Cover

When “intelligence” wastes cycles

Most teams building LLM-powered products eventually realize that a large portion of their API costs come not from new insights, but from repeated questions.

A support bot, an internal assistant, or an analytics copilot, all encounter thousands of near-identical queries:

“How do I pass the API key to the local model gateway?”
“Why is the dev database connection timing out?”
“How can I refresh the cache without restarting the service?”

Each of those prompts gets re-tokenized, re-embedded, and re-sent to an LLM even when the model has already answered an equivalent question a minute earlier.

What do we have as a result? Burned tokens, wasted latency, and duplicated reasoning.

Vector databases solved storage, not reuse

The industry's first instinct was to throw vector databases at the problem. They excel at persistent embeddings and semantic retrieval, but they were never built for reuse. What they lack are TTL policies, eviction strategies, and atomic snapshotting of in-flight state. In other words, they store knowledge, not memory.

Traditional vector databases follow a key:value paradigm: they persist embeddings indefinitely so they can be queried later, much like records in a datastore. A semantic cache, by contrast, treats embeddings as dynamic memory — governed by similarity, expiration, and adaptive retention. Its goal is not to archive information, but to avoid redundant reasoning across millions of semantically similar requests.

With a semantic cache such as VCAL, cached answers can stay valid for days or weeks, depending on data volatility and TTL settings. This moves caching from short-term repetition avoidance to long-horizon semantic reuse where reasoning itself becomes a reusable resource rather than a recurring cost.

In essence, VCAL bridges the gap between data retrieval and cognitive efficiency, turning past computation into future acceleration.

From data stores to memory layers

In my previous Dev.to article, I explained how we built VCAL, a Rust-based semantic cache that sits between your app and the LLM. Instead of persisting every vector, it memorizes embeddings for a short time, indexed by semantic similarity and metadata.

When a new query arrives, VCAL compares it to cached vectors. If it is close enough — a cache hit — the LLM call is skipped, and the stored answer is returned in milliseconds. Otherwise, the request proceeds normally, and the response is stored for future matches.

The design combines concepts from vector search and traditional caching systems, enhanced with features for resilience and monitoring:

  • HNSW index for ultra-fast approximate similarity search.
  • TTL and LRU eviction for automatic cache turnover.
  • Snapshotting for persistence between restarts.
  • Prometheus metrics for observability.

All of it runs on-prem, next to your model or gateway with no remote dependencies.

Why local caching changes the economics

Unlike vector databases, a local semantic cache has one simple purpose: avoid redundant reasoning. Each avoided LLM call translates directly into saved tokens, lower API bills, and shorter response times.

In real deployments we’ve seen:

  • 30–60 % reduction in LLM calls
  • Millisecond-level latency on repeated queries enabling near-real-time responsiveness
  • Predictable resource usage: no external round-trips, no cloud egress costs, and no multi-tenant contention

At scale, the more your users interact, the greater the savings become. Instead of paying per token for every repetition, you amortize prior reasoning across sessions and teams.

And because VCAL runs inside your private environment, all caching and embeddings stay under your control ensuring data privacy, compliance, and deterministic performance even in regulated industries.

A new layer in the AI stack

If you visualize the modern LLM stack, the simplified design looks like this:

User → Application → LLM Gateway → Model

or, if a RAG (Retrieval-Augmented Generation) framework is involved:

User → Application → Retriever → Vector DB → (context) → LLM Gateway → Model

Adding a semantic cache such as VCAL introduces this new dimension:

User → Application → Retriever → Vector DB → (context) → Semantic Cache → LLM Gateway → Model

Here, the cache checks whether a semantically equivalent query was already answered. If found, the response is returned instantly — skipping tokenization, embedding, and inference altogether. If not, the request continues as usual, and the new answer is stored for future reuse.

Vector databases still matter but they belong to the knowledge layer, not the inference path. What has been missing so far is a memory layer that prevents repeated reasoning altogether. Semantic caching fills the missing “memory” slot in between. It is not a replacement for RAG, it is a complement. While RAG injects context, caching avoids duplication.

Engineering for low latency

Achieving millisecond response times in semantic caching requires more than just a fast similarity search algorithm. It’s the result of careful coordination between data structures, memory layout, and concurrency control.

The cache can be implemented efficiently in systems programming languages such as Rust, using an HNSW-based index for approximate nearest-neighbor search. HNSW provides logarithmic-scale query complexity while maintaining accuracy for large collections of embeddings, making it suitable for workloads that reach millions of cached entries.

Low latency also depends on predictable memory management and lock-free or fine-grained synchronization between threads. Instead of allocating and freeing vectors dynamically, embeddings are often stored in preallocated arenas or memory-mapped regions to minimize fragmentation and system calls. Parallel workers can update the index or evaluate similarity thresholds concurrently, so that retrieval scales with the number of available cores.

In practice, a semantic cache can be deployed as a lightweight service beside an inference gateway, communicating over local HTTP or gRPC. It can also be embedded directly into an application process when minimal overhead is required, for example, within an agent runtime or API handler.

The bigger picture

Caching has always been an invisible driver of performance — from CPU registers that reuse instructions, to CDNs that reuse content, to databases that reuse queries.

Each generation of systems extends the notion of what can be reused. As language models enter production, we are witnessing a shift toward semantic reuse: reusing meaning rather than data. This enables systems to recall previous reasoning instead of repeating it — a step toward more efficient and sustainable AI infrastructure.

In this new layer of the AI stack, semantic caching becomes a form of reasoning memory: it stores the results of understanding, not just storage operations. Instead of recomputing the same insight across thousands of near-identical prompts, we can recall it instantly — with full control over latency, privacy, and cost.

Further reading

For readers interested in implementation details and open-source examples:


Thank you for reading! Semantic caching is still an emerging concept — every real-world use case helps shape how we think about efficient reasoning. Share yours in the comments if you’d like to join the conversation.

How I Created a Semantic Cache Library for AI

· 4 min read
Founder of VCAL Project

Originally published on Dev.to on October 27, 2025.
Read the Dev.to version

Cover

Have you ever wondered why LLM apps get slower and more expensive as they scale, even though 80% of user questions sound pretty similar? That’s exactly what got me thinking recently: why are we constantly asking the model the same thing?

That question led me down the rabbit hole of semantic caching, and eventually to building VCAL (Vector Cache-as-a-Library), an open-source project that helps AI apps remember what they’ve already answered.


The “Eureka!” Moment

It started while optimizing an internal support chatbot that ran on top of a local LLM. Logs showed hundreds of near-identical queries:

“How do I request access to the analytics dashboard?”
“Who approves dashboard access for my team?”
“My access to analytics was revoked — how do I get it back?”

Each one triggered a full LLM inference: embedding the query, generating a new answer, and consuming hundreds of tokens even though all three questions meant the same thing.

So I decided to create a simple library that would embed each question, compare it to what was submitted earlier, and if it’s similar enough, return the stored answer instead of generating an LLM response, all this before asking the model.

I wrote a prototype in Rust — for performance and reliability — and designed it as a small vcal-core open-source library that any app could embed.

The first version of VCAL could:

  • Store and search vector embeddings in RAM using HNSW graph indexing
  • Handle TTL and LRU evictions automatically
  • Save snapshots to disk so it could restart fast

Later came VCAL Server, a drop-in HTTP API version for teams that wanted to cache answers across multiple services while deploying it on-prem or in a cloud.

Screenshot: Grafana dashboard showing cache hits and cost saving


What It Feels Like to Use

Unlike a full vector database, VCAL isn’t designed for long-term storage or analytics. I didn’t want to build another vector database.
VCAL is intentionally lightweight. It is a fast, in-memory semantic cache optimized for repeated LLM queries.

Integrating VCAL takes minutes.
Instead of calling the model directly, you send your query to VCAL first.
If a similar question has been asked before — and the similarity threshold can be tuned — VCAL returns the answer from its cache in milliseconds. If it’s a new question, VCAL asks the LLM, stores the result, and returns it.
Next time, if a semantically similar question comes in, VCAL answers instantly.

It’s like adding a memory layer between your app and the model — lightweight, explainable, and under your full control.

Flow diagram: user → VCAL → LLM


Lessons Learned

  • LLMs love redundancy. Once you start caching semantically, you realize how often people repeat the same question with different words.
  • Caching semantics ≠ caching text. Cosine similarity and vector distances matter more than exact matches.
  • Performance scales beautifully. A well-tuned cache can handle thousands of lookups per second, even on modest hardware.
  • It scales big. A single VCAL Server instance can comfortably store and serve up to 10 million cached answers in memory, depending on embedding dimensions and hardware.

What’s Next

We’re now working on a licensing server, enterprise snapshot formats, and RAG-style extensions, so teams can use VCAL not just for Q&A caching, but as the foundation for private semantic memory.

If you’re building AI agents, support desks, or knowledge assistants, you’ll likely benefit from giving your system a brain that remembers.

You can explore more at vcal-project.com - try the free 30-day Trial Evaluation of VCAL Server or jump into the open-source vcal-core version on GitHub.


Thanks for reading!
If this resonates with you, please drop a comment. I’d love to hear how you’re approaching caching and optimization for AI apps.