Skip to main content

3 posts tagged with "artificial-intelligence"

View All Tags

What Actually Matters When You Try to Reduce LLM Costs

· 7 min read
Founder of VCAL Project

Originally published on Medium.com on April 5, 2026.
Read the Medium.com version

After publishing the first release of AI Cost Firewall, I thought the hard part was done.

The idea was simple and it worked immediately: avoid sending duplicate or semantically similar requests to the LLM, and you reduce cost.

I described that initial approach in more detail here: How to Reduce OpenAI API Costs with Semantic Caching

And it did work.

But once I started pushing it further — adding more metrics, handling edge cases, running real traffic through it — it became clear that the initial idea was only a small part of the problem.

Reducing LLM cost is not just about caching. It’s about understanding where the cost actually comes from, what “savings” really mean, and what begins to break when a system moves from a controlled demo into something closer to production.


The First Insight Still Holds

The original observation hasn’t changed, and neither has the core architecture. The system still solves the same underlying problem.

  • Users repeat themselves.
  • Applications repeat themselves.
  • Agents repeat themselves.

Often the wording changes slightly, but the intent remains the same. From the model’s perspective, however, every variation is a brand new request. And every request has a cost.

So yes — caching works. It reduces cost immediately, often without any changes to the application itself.

But that’s only the surface. The deeper questions only appear once you try to rely on it.


The First Misconception: “Caching Is Free”

In the beginning, the results looked almost too good.

In a demo environment, exact cache hits dominated. When a request hit the cache, it meant no API call, no tokens, and almost zero latency. It felt like pure gain, as if cost reduction came with no trade-offs.

That illusion disappears the moment you introduce semantic caching properly. Because semantic caching requires embeddings.

To determine whether two requests are similar, you first need to convert them into vectors. That means calling an embedding model, storing the result, and comparing it against existing data. Only then can you decide whether to reuse a response or forward the request to the LLM.

And embeddings are not free.

At that point, the equation changes:

Net savings = avoided LLM cost − embedding cost

This is where things become more delicate.

If your similarity threshold is too low, you generate embeddings too often. If your traffic is highly unique, most of those embeddings never lead to a cache hit. If your embedding model is expensive, the optimization starts working against you.

What initially looked like a simple cost reduction mechanism becomes something that requires careful balance.

That was the moment when the project stopped being just a clever shortcut and started behaving like a system that needs tuning.


Where the Savings Actually Come From

Looking at real metrics changed another assumption.

Intuitively, semantic caching feels like the main feature. It’s the “intelligent” part of the system. But in practice, most of the savings come from something much simpler — exact matches.

A surprisingly large portion of traffic is not just similar — it is identical. The same prompt appears again and again, sometimes minutes apart, sometimes hours later. Once you see it in real data, it’s hard to ignore.

Semantic caching still matters, but its role is different. It extends the coverage rather than forming the base.

Without exact caching, the system loses most of its immediate impact. Without semantic caching, you miss additional opportunities. But they are not equal contributors, and treating them as such leads to wrong expectations.


What Breaks in Real Usage

As soon as real traffic enters the system, subtle issues begin to surface.

One of the first is payload size.

LLM requests tend to grow over time. Prompts accumulate context, system messages expand, conversation history becomes longer. In some cases, payloads become unexpectedly large — either naturally or intentionally.

Without limits, a single request can consume disproportionate resources. What seemed like a minor edge case quickly turns into something that needs explicit control.

Another issue is validation.

If you accept any model name, any payload structure, and any input format, your system remains flexible but your metrics lose meaning. Cost calculations become inconsistent, comparisons stop being reliable, and “savings” become difficult to interpret.

Adding strict validation changes that. It makes the system more predictable, but also more opinionated.

At that point, it is no longer just a transparent proxy. It becomes a controlled gateway. And that shift is intentional.

It’s worth noting that this layer is not a security firewall in the traditional sense. It does not attempt to detect malicious prompts, prevent prompt injection, or enforce content policies. Those concerns belong to the application layer, where context, user intent, and business logic are better understood. The goal here is different: to control cost, reduce unnecessary requests, and make LLM usage more predictable.


Architecture Matters More Than It Seems

From the outside, the architecture still looks simple. One layer in front of the LLM, minimal components, no intrusive changes to the application.

But over time, the reasoning behind each choice becomes more important.

  • Redis handles exact matches because it is fast and predictable.
  • Qdrant supports semantic search efficiently without adding unnecessary complexity.
  • Rust ensures that this layer can sit in the request path without introducing latency or instability.

Individually, these are straightforward decisions.

Together, they define whether the system can operate reliably under load. Because once this layer becomes part of the critical path, it is no longer optional. If it slows down, everything slows down. If it fails, everything fails.

At that point, optimization is no longer the only goal. Stability becomes equally important.


The Difference Between Demo Traffic and Reality

It is easy to produce impressive results in a controlled environment.

  • high cache hit rates
  • clear cost savings
  • clean, predictable behavior

But those results are shaped by the input.

Real systems behave differently. There are always new queries, unexpected variations, and edge cases that were not part of the initial design. You never reach 100% cache hits — and that’s not a failure.

A healthy system still generates misses. It still calls the model. It still adapts to new inputs.

The goal is not to eliminate LLM usage entirely. The goal is to eliminate unnecessary usage. That distinction becomes much more important once you move beyond a demo.


Where This Is Going

What started as a simple way to avoid duplicate requests is gradually evolving into something broader.

It is no longer just about caching. It becomes a control layer for how LLMs are used:

  • cost visibility
  • request validation
  • provider abstraction
  • traffic control

Caching is still at the center, but it is no longer the whole story. It is the entry point into a larger set of concerns that appear once LLM usage grows.


Final Thoughts

The original idea was simple: Don’t pay twice for the same answer.

That still holds. But applying that idea in a real system reveals a different challenge.

The difficult part is not avoiding the call. It is understanding when avoiding it actually makes sense. Because cost optimization, in practice, is not just about reducing usage. It is about understanding your system well enough to reduce costs in the right way.

You can explore the project here:

https://github.com/vcal-project/ai-firewall

How to Reduce OpenAI API Costs with Semantic Caching

· 6 min read
Founder of VCAL Project

Originally published on Medium.com on March 21, 2026.
Read the Medium.com version

A simple OpenAI-compatible gateway that eliminates duplicate requests and cuts token usage

While working on LLM-powered tools for my customer, I kept seeing something that didn’t feel right.

Users were asking the same or similar questions again and again. Support queries repeated. Internal assistants received nearly identical prompts. Even AI agents were looping through similar requests.

At first, it didn’t look like a problem. That’s just how users behave.

But then I looked at the cost.

Every repeated question meant another API call. Another batch of tokens. Another charge. Over time, it added up more than was expected.

I realized something simple:

We are paying multiple times for the same answer.


Why Existing Solutions Didn’t Quite Work

Initially I looked at the available tools.

Redis helped with exact caching, but only when the prompt was identical. The moment a user rephrased the question slightly, the cache missed. “How do I get access to Jira?” and “Cannot get access to Jira” were treated as completely different requests.

I also explored RedisVL, which brings vector search capabilities into Redis. It moves in the right direction by combining caching and similarity in one place. But in practice, it still requires setting up embedding flows, defining schemas, tuning similarity thresholds, and integrating it manually into the LLM request pipeline.

Vector databases like Milvus, Weaviate, or Qdrant seemed promising as well. They can detect semantic similarity effectively, but integrating them into the request flow means building additional pipelines, managing embeddings, and writing glue code.

All of these tools are powerful, but they aren’t simple.

More importantly, none of them are designed as a drop-in layer in front of an LLM API. There was no unified solution that combined caching, semantic matching, and cost awareness in one place.


What I Built Instead

At some point, I decided to take a step back and ask a simple question:

What if we just put one smart layer in front of the LLM?

That’s how the AI Cost Firewall started.

Instead of modifying applications or adding complex pipelines, AI Firewall intercepts requests before they reach the model. If it’s already seen a similar request, it returns the cached response. If not, it forwards it, stores the result, and moves on.

From the application’s perspective, nothing changes. It still talks to an OpenAI-compatible API.

But behind the scenes, unnecessary calls disappear.


How It Works (Without the Complexity)

Screenshot: AI Cost Firewall Architecture

I intentionally kept the architecture minimal.

At the core, there’s a Rust-based API gateway that speaks the same language as the OpenAI API. For caching, I use Redis for exact matches and Qdrant for semantic similarity. Prometheus and Grafana provide visibility into what’s happening.

A request comes in, we check the cache, and only if needed do we call the LLM.

That’s it.

No SDK rewrites. No major architectural changes. Just one additional layer.


Why I Chose Rust

Since this component sits directly in the request path, performance matters.

I chose Rust because it provides low latency and predictable performance without garbage collection pauses. It handles concurrency well and keeps the memory footprint small, which makes it ideal for containerized deployments.

Most importantly, we can trust it not to become the bottleneck.

Why I Open-Sourced It

This layer sits between the application and the AI provider. That’s a sensitive place.

I felt it had to be transparent and auditable. Open source makes it easier to trust, easier to adopt, and easier to extend.

It also keeps the core idea simple: reducing costs shouldn’t introduce new risks or lock you into a vendor.


Getting Started in Minutes

I wanted the setup to be as simple as possible.

Clone the repository, start Docker, and point your application to a new endpoint.

git clone https://github.com/vcal-project/ai-firewall
cd ai-firewall
cp configs/ai-firewall.conf.example configs/ai-firewall.conf
nano configs/ai-firewall.conf # Replace the placeholders with your API keys
docker compose up -d

After that, you just replace your API base URL with:

http://localhost:8080/v1/chat/completions

That’s the entire integration.


What Changed for Me

Once I started using this approach, two things became obvious.

First, a surprisingly large portion of requests was served directly from cache. The reason? All of them were already answered before.

Second, response times improved whenever the cache was hit.

I didn’t need to optimize prompts or switch models to see an effect. Just avoiding redundant calls made a noticeable difference.

To make this visible, I add a simple Grafana dashboard.

It shows how many requests are served from cache vs forwarded to the LLM, along with the estimated cost savings in real time.

Screenshot: Grafana dashboard showing cache hits and cost saving

The key metrics are:

  • cache hit ratio (how many requests never reach the LLM)
  • total tokens saved
  • estimated cost savings

What surprised me most was how quickly the savings accumulated even with relatively small traffic.


What Comes Next

I see this as a starting point rather than a finished product.

Next, I’m focusing on adding support for other LLM providers beyond OpenAI. Expanding analytics is another priority, along with exploring multi-model setups and smarter routing.

There’s still a lot to build — and that’s exactly the point.


Final Thoughts

AI costs don’t spike all at once. They grow quietly, request by request.

And in many cases, a large part of that cost is unnecessary.

We didn’t need a more complex system to reduce it. We just needed to stop sending the same request twice.

Sometimes the most effective optimization is the simplest one:

Not calling the model at all.


If you’re running LLM-powered tools and want to reduce costs without changing your application architecture, you can try it here:

https://github.com/vcal-project/ai-firewall

Why Edge AI Needs Lightweight Semantic Caches — and What Makes Them Hard to Build

· 6 min read
Founder of VCAL Project

Originally published on Medium.com on November 27, 2025.
Read the Medium.com version

Cover

Today edge computing is reshaping the way AI systems are deployed. Instead of sending every request to centralized cloud infrastructure, more computation is happening on devices closer to end-users. These “edge environments” include IoT gateways, on-premise servers, mobile devices, micro-VMs, serverless functions, and browser-based applications. The appeal is clear: moving computation closer to where data is generated reduces latency, minimizes bandwidth requirements and allows organizations to satisfy strict data-privacy rules.

At the same time, WebAssembly (WASM) has emerged as a portable, sandboxed runtime for executing code in highly constrained or security-sensitive environments. Originally designed for browsers, WASM now runs in cloud edge workers, serverless platforms, and isolated environments where traditional binaries cannot be executed. These runtimes often restrict access to system calls such as networking, threading, or the local filesystem. They operate under strict memory limits, sometimes as low as tens of megabytes, and they prioritize deterministic, predictable execution.

Altogether, while offering obvious advantages, running AI components at the edge introduces its own challenges, especially when applications rely on semantic search, embeddings, or large language models (LLM).


A major issue arises when AI applications repeatedly generate similar responses to similar prompts. In a cloud setting this inefficiency is tolerable, but at the edge it becomes costly. Edge nodes often have hard limits on CPU time and memory allocation, meaning that even small local language models may struggle to meet real-time latency budgets. A semantic cache — a system that stores answers together with an embedding vector and returns a cached answer when the incoming request is semantically similar — is a natural solution. However, building such a system for constrained environments is significantly more difficult than building one for the cloud.

The first challenge is memory. Classical vector databases and similarity search engines rely on complex indexing structures such as HNSW graphs, which are fast but memory-intensive. Standard configurations easily grow to hundreds of megabytes and often assume the availability of multi-threading, background maintenance processes, and dynamic memory growth. Edge workers and WASM isolates cannot accommodate this. In many cases, the runtime enforces strict caps on linear memory and disallows growing beyond a fixed boundary. This immediately rules out most existing semantic search libraries, even before considering cold-start overhead or storage.

The second constraint is the execution environment itself. WASM runtimes typically do not expose POSIX-like APIs (Portable Operating System Interface, a family of standards developed by IEEE that define consistent application programming interfaces). Features such as mmap, file descriptors, or native sockets are unavailable unless the host explicitly provides them through WASI (WebAssembly System Interface), and even then, support varies. This makes it almost impossible to run vector databases “as-is,” because they depend heavily on operating system functionality and persistent background services. In edge environments developers have only a few milliseconds to initialize modules, produce a response, and return control to the runtime. A semantic cache that takes hundreds of milliseconds to load an index simply cannot be deployed in these contexts.

Cold-start behavior is another architectural concern. Unlike long-running cloud servers, edge workers may be rapidly created and destroyed. A new isolate might handle only one or two requests before being recycled. For AI applications, this means that any semantic cache must load extremely quickly — ideally in a few milliseconds — and must not rely on heavy initialization or dynamic graph reconstruction. Snapshotting becomes essential: developers need the ability to store the cache state in a compact format that loads deterministically and quickly into memory.

Then, there is also the question of energy and cost efficiency. Edge nodes operate on limited power budgets, especially in IoT scenarios. Recomputing the same embedding or calling an external LLM repeatedly wastes both energy and bandwidth. Reducing redundant inference calls requires a semantic memory layer that can match incoming queries to existing knowledge without exceeding stringent resource constraints.

Privacy regulations add an additional layer of complexity. One of the motivations for moving AI workloads to the edge is to keep sensitive data local. But to do that effectively, the system must avoid unnecessarily sending repeated questions or logs to a central model. A semantic cache therefore becomes not just a performance optimization but a privacy mechanism: if the system can answer from its local memory, no data transmission to external LLMs is required. Unfortunately, building such a cache in environments with restricted storage, no access to background processes, and strict runtime quotas is a non-trivial task.

These are the conditions under which traditional semantic search infrastructure begins to struggle. Large vector databases simply assume too much: too much RAM, too much access to the operating system, too much startup time, and too much persistence. Even lightweight semantic caches designed for server applications often rely on threading, shared memory, file-based checkpointing, or dynamically growing allocations. Most embedding-based caches were never designed with WASM runtimes, edge workers, or IoT gateways in mind.


This is precisely the gap that newer designs aim to address. Solutions like VCAL approach semantic caching not as a distributed system or standalone service but as a small in-process library that can run with minimal memory and without heavy OS dependencies. Instead of behaving like a database, it behaves more like a CPU-level cache for AI reasoning, storing question-answer pairs and their embeddings in an optimized structure that can fit within the constraints of edge and WASM environments. By avoiding reliance on network calls, background threads, or large indices, such systems become suitable for serverless workers, browser WASM modules, or embedded devices with limited RAM.

In this sense, the semantic cache becomes a missing piece of infrastructure for edge AI. As more organizations push inference closer to the user, the need for a lightweight, deterministic, low-memory semantic lookup system grows. The limitations of edge platforms — from strict memory caps to rapid cold starts — make this a difficult problem, and the lack of suitable solutions has slowed the adoption of AI features outside centralized cloud environments. As WASM matures and edge utilities evolve, semantic caching may become a standard part of the AI pipeline, enabling faster, cheaper, and more privacy-preserving deployments across a wide range of devices.