Blog | VCAL Docs

From Words to Vectors: How Semantics Traveled from Linguistics to Large Language Models

January 17, 2026 · 8 min read

Founder of VCAL Project

Originally published on Dev.to on January 17, 2026.
Read the Dev.to version

Why meaning moved from definitions to structure — and what that changed for modern AI

When engineers talk about semantic search, embeddings, or LLMs that "understand" language, it often sounds like something fundamentally new. Yet the problems modern AI systems face — meaning, reference, ambiguity, and context — were already central questions in linguistics and philosophy more than a century ago.

This article traces how the concept of semantics evolved across disciplines: from linguistics and philosophy, through symbolic AI and statistical NLP, and finally into the neural architectures that power modern large language models, and why this history matters for how we design retrieval, memory, and language systems today. The journey reveals that today's AI systems are not a break from the past, but the convergence of long-standing ideas finally made computationally feasible.

Linguistic Origins: Meaning as a System, Not a Label

Modern semantics begins not with computers, but with language itself. In the late 19th and early 20th centuries, linguists began to reject the naive idea that words simply "point" to things in the world. One of the most influential figures in this shift was Ferdinand de Saussure, who argued that language is a structured system of signs rather than a naming scheme.

Saussure proposed that each linguistic sign consists of two inseparable parts: the signifier (the sound or written form) and the signified (the concept evoked). Crucially, the relationship between the two is arbitrary. There is nothing inherently "dog-like" about the word dog. Its meaning arises because it occupies a position within a broader system of contrasts: dog is meaningful because it is not cat, not wolf, not table.

This was a radical idea at the time. Meaning, Saussure claimed, is relational. Words derive significance from how they differ from other words, not from direct correspondence with reality. This insight quietly laid the conceptual groundwork for everything from structural linguistics to modern vector-based representations.

Philosophy of Language: Meaning, Logic, and Composition

While linguists focused on structure, philosophers sought precision. In particular, Gottlob Frege transformed semantics by embedding it into formal logic. Frege introduced a critical distinction between sense — the mode of presentation of an idea, and reference — the actual object being referred to.

This distinction explained how two expressions could refer to the same thing while conveying different information. "The morning star" and "the evening star" both refer to Venus, yet they are not interchangeable in all contexts. Meaning, therefore, could not be reduced to reference alone.

More importantly, Frege formalized the idea of compositionality: the meaning of a sentence is determined by the meanings of its parts and the rules used to combine them. This principle became foundational not only in philosophy, but later in programming languages, logic systems, and early AI models.

In retrospect, compositionality is what allowed meaning to be treated as something computable, at least in theory.

Early Artificial Intelligence: When Meaning Was Symbolic

When I studied linguistics at university many years ago, everything up to this point was already part of the curriculum. Structural linguistics, philosophy of language, and formal semantics provided a solid theoretical foundation. What none of us could have anticipated at the time was how directly these ideas would later intersect with computer science in what would come to be called artificial intelligence.

When AI emerged as a field in the mid-20th century, it inherited philosophy's confidence in symbols and logic. Early systems assumed that meaning could be explicitly represented through formal structures: symbols, predicates, rules, and ontologies. To "understand" language was to transform symbols according to carefully designed rules.

For a while, this worked. Expert systems, knowledge graphs, and first-order logic engines achieved impressive results in narrowly defined domains such as medical diagnosis, chemical analysis, and configuration problems. Within carefully bounded worlds, symbolic semantics appeared tractable.

Natural language, however, quickly exposed the limits of this approach. Human language is ambiguous, context-dependent, and constantly evolving. Encoding all possible meanings and interpretations proved not merely difficult, but fundamentally unscalable. Symbolic systems were brittle: they failed not gradually, but catastrophically, when faced with inputs that deviated even slightly from their assumptions.

Semantics, it turned out, was far messier than logic had allowed, and far more resistant to being fully written down.

The Statistical Shift: Meaning Emerges from Usage

A quiet revolution began when linguists and computer scientists started to look not at rules, but at usage patterns. The idea that meaning could be inferred from how words are used rather than how they are defined gained traction in the mid-20th century.

The core insight was simple but profound: words that appear in similar contexts tend to have similar meanings. Instead of encoding semantics explicitly, one could measure it statistically by analyzing large corpora of text.

This approach, known as distributional semantics, reframed meaning as something empirical rather than prescriptive. Words became vectors of co-occurrence statistics. Similarity was no longer binary or rule-based, but graded and approximate.

This was a decisive break from symbolic AI and a return, in spirit, to Saussure's relational view of meaning.

Word Embeddings: Geometry Becomes Semantics

Distributional ideas matured dramatically with the introduction of neural word embeddings, particularly models like Word2Vec. Instead of relying on sparse frequency counts, these models learned dense, low-dimensional vector representations optimized to predict linguistic context.

What emerged surprised even their creators. Semantic relationships appeared as geometric regularities in vector space. Differences between vectors encoded analogies, hierarchies, and semantic proximity. Meaning became something you could measure with cosine similarity.

This was not symbolic understanding, but it was not random either. It was structure: learned rather than designed.

For the first time, machines exhibited behavior that looked like semantic intuition, despite having no explicit definitions or rules.

Contextual Semantics: Meaning Is Not Fixed

Static embeddings had a fundamental limitation: each word had exactly one vector, regardless of context. But human language does not work that way. The meaning of a word shifts depending on surrounding words, speaker intent, situation, and even emotion.

Transformer-based models, particularly BERT, addressed this by making representations contextual. Instead of asking "What does this word mean?", the model learned to ask "What does this word mean here?"

Through attention mechanisms, transformers model relationships between tokens dynamically. Meaning is no longer stored in a single vector per word, but distributed across layers and activations that respond to context.

This marked a crucial step toward pragmatic semantics: language as it is actually used, not as it is abstractly defined.

Large Language Models: Semantics as Emergent Structure

Large language models such as GPT do not contain explicit semantic representations in the traditional sense. They are trained to predict the next token in a sequence. And yet, at scale, they display behaviors that look strikingly semantic: summarization, reasoning, translation, abstraction.

The key idea is emergence. As models compress vast amounts of linguistic data, they internalize regularities about the world, language, and human communication. Semantics arises not as a module, but as a side effect of learning efficient representations.

These models do not "know" meaning in a philosophical sense. But they operate in a space where syntax, semantics, and pragmatics are inseparable, and where relational structure dominates.

When Meaning Becomes Operational

For practitioners building semantic search systems, RAG pipelines, or LLM-adjacent infrastructure, this history is not academic background — it is an explanation of why certain designs consistently work while others fail. Exact matching breaks down because natural language rarely repeats itself verbatim. Embeddings succeed not because they are clever, but because they mirror how meaning behaves in practice: approximately, relationally, and with tolerance for variation.

Once this is understood, several architectural consequences follow naturally. Retrieval quality depends less on perfect recall and more on selecting representations that preserve semantic neighborhoods. Caching strategies become viable only when equivalence is defined by similarity rather than identity. Evaluation metrics must account for graded relevance instead of binary correctness. Even system boundaries shift: components no longer exchange "facts", but approximations of meaning that remain useful within context.

Semantic systems are effective precisely because they do not attempt to eliminate ambiguity. They absorb it. Whether you are designing a vector store, placing a semantic cache in front of an LLM, or building a long-term memory layer for conversational systems, you are implicitly making choices about how much approximation your system tolerates and where that tolerance is enforced.

Closing Thought: Semantics as Shared Infrastructure

What began as a linguistic insight, that words gain meaning through their relations to other words, has quietly become an organizing principle for entire computational systems. Meaning no longer lives in dictionaries, rules, or symbols, but in patterns: in how expressions cluster, diverge, and reappear across vast landscapes of language. Semantics is no longer something a system contains; it is something a system moves through.

This shift took more than a century to unfold. It required philosophers to separate sense from reference, linguists to abandon naming theories, and engineers to accept approximation over certainty. Only when data became abundant and computation relatively cheap did this long trajectory converge into something operational. Semantics, once debated in lecture halls and footnotes, has become infrastructure — implicit, distributed, and shared.

That idea, radical when first proposed, has been waiting over a hundred years for enough data and compute to become practical.

And now, finally, it has.

Why Edge AI Needs Lightweight Semantic Caches — and What Makes Them Hard to Build

November 27, 2025 · 6 min read

Sergey Lunev

Founder of VCAL Project

Originally published on Medium.com on November 27, 2025.
Read the Medium.com version

Cover

Today edge computing is reshaping the way AI systems are deployed. Instead of sending every request to centralized cloud infrastructure, more computation is happening on devices closer to end-users. These “edge environments” include IoT gateways, on-premise servers, mobile devices, micro-VMs, serverless functions, and browser-based applications. The appeal is clear: moving computation closer to where data is generated reduces latency, minimizes bandwidth requirements and allows organizations to satisfy strict data-privacy rules.

At the same time, WebAssembly (WASM) has emerged as a portable, sandboxed runtime for executing code in highly constrained or security-sensitive environments. Originally designed for browsers, WASM now runs in cloud edge workers, serverless platforms, and isolated environments where traditional binaries cannot be executed. These runtimes often restrict access to system calls such as networking, threading, or the local filesystem. They operate under strict memory limits, sometimes as low as tens of megabytes, and they prioritize deterministic, predictable execution.

Altogether, while offering obvious advantages, running AI components at the edge introduces its own challenges, especially when applications rely on semantic search, embeddings, or large language models (LLM).

A major issue arises when AI applications repeatedly generate similar responses to similar prompts. In a cloud setting this inefficiency is tolerable, but at the edge it becomes costly. Edge nodes often have hard limits on CPU time and memory allocation, meaning that even small local language models may struggle to meet real-time latency budgets. A semantic cache — a system that stores answers together with an embedding vector and returns a cached answer when the incoming request is semantically similar — is a natural solution. However, building such a system for constrained environments is significantly more difficult than building one for the cloud.

The first challenge is memory. Classical vector databases and similarity search engines rely on complex indexing structures such as HNSW graphs, which are fast but memory-intensive. Standard configurations easily grow to hundreds of megabytes and often assume the availability of multi-threading, background maintenance processes, and dynamic memory growth. Edge workers and WASM isolates cannot accommodate this. In many cases, the runtime enforces strict caps on linear memory and disallows growing beyond a fixed boundary. This immediately rules out most existing semantic search libraries, even before considering cold-start overhead or storage.

The second constraint is the execution environment itself. WASM runtimes typically do not expose POSIX-like APIs (Portable Operating System Interface, a family of standards developed by IEEE that define consistent application programming interfaces). Features such as mmap, file descriptors, or native sockets are unavailable unless the host explicitly provides them through WASI (WebAssembly System Interface), and even then, support varies. This makes it almost impossible to run vector databases “as-is,” because they depend heavily on operating system functionality and persistent background services. In edge environments developers have only a few milliseconds to initialize modules, produce a response, and return control to the runtime. A semantic cache that takes hundreds of milliseconds to load an index simply cannot be deployed in these contexts.

Cold-start behavior is another architectural concern. Unlike long-running cloud servers, edge workers may be rapidly created and destroyed. A new isolate might handle only one or two requests before being recycled. For AI applications, this means that any semantic cache must load extremely quickly — ideally in a few milliseconds — and must not rely on heavy initialization or dynamic graph reconstruction. Snapshotting becomes essential: developers need the ability to store the cache state in a compact format that loads deterministically and quickly into memory.

Then, there is also the question of energy and cost efficiency. Edge nodes operate on limited power budgets, especially in IoT scenarios. Recomputing the same embedding or calling an external LLM repeatedly wastes both energy and bandwidth. Reducing redundant inference calls requires a semantic memory layer that can match incoming queries to existing knowledge without exceeding stringent resource constraints.

Privacy regulations add an additional layer of complexity. One of the motivations for moving AI workloads to the edge is to keep sensitive data local. But to do that effectively, the system must avoid unnecessarily sending repeated questions or logs to a central model. A semantic cache therefore becomes not just a performance optimization but a privacy mechanism: if the system can answer from its local memory, no data transmission to external LLMs is required. Unfortunately, building such a cache in environments with restricted storage, no access to background processes, and strict runtime quotas is a non-trivial task.

These are the conditions under which traditional semantic search infrastructure begins to struggle. Large vector databases simply assume too much: too much RAM, too much access to the operating system, too much startup time, and too much persistence. Even lightweight semantic caches designed for server applications often rely on threading, shared memory, file-based checkpointing, or dynamically growing allocations. Most embedding-based caches were never designed with WASM runtimes, edge workers, or IoT gateways in mind.

This is precisely the gap that newer designs aim to address. Solutions like VCAL approach semantic caching not as a distributed system or standalone service but as a small in-process library that can run with minimal memory and without heavy OS dependencies. Instead of behaving like a database, it behaves more like a CPU-level cache for AI reasoning, storing question-answer pairs and their embeddings in an optimized structure that can fit within the constraints of edge and WASM environments. By avoiding reliance on network calls, background threads, or large indices, such systems become suitable for serverless workers, browser WASM modules, or embedded devices with limited RAM.

In this sense, the semantic cache becomes a missing piece of infrastructure for edge AI. As more organizations push inference closer to the user, the need for a lightweight, deterministic, low-memory semantic lookup system grows. The limitations of edge platforms — from strict memory caps to rapid cold starts — make this a difficult problem, and the lack of suitable solutions has slowed the adoption of AI features outside centralized cloud environments. As WASM matures and edge utilities evolve, semantic caching may become a standard part of the AI pipeline, enabling faster, cheaper, and more privacy-preserving deployments across a wide range of devices.

Beyond Vector Databases: The Case for Local Semantic Caching

November 6, 2025 · 6 min read

Sergey Lunev

Founder of VCAL Project

Originally published on Medium.com on November 6, 2025.
Read the Medium.com version

Cover

When “intelligence” wastes cycles

Most teams building LLM-powered products eventually realize that a large portion of their API costs come not from new insights, but from repeated questions.

A support bot, an internal assistant, or an analytics copilot, all encounter thousands of near-identical queries:

“How do I pass the API key to the local model gateway?”
“Why is the dev database connection timing out?”
“How can I refresh the cache without restarting the service?”

Each of those prompts gets re-tokenized, re-embedded, and re-sent to an LLM even when the model has already answered an equivalent question a minute earlier.

What do we have as a result? Burned tokens, wasted latency, and duplicated reasoning.

Vector databases solved storage, not reuse

The industry's first instinct was to throw vector databases at the problem. They excel at persistent embeddings and semantic retrieval, but they were never built for reuse. What they lack are TTL policies, eviction strategies, and atomic snapshotting of in-flight state. In other words, they store knowledge, not memory.

Traditional vector databases follow a key:value paradigm: they persist embeddings indefinitely so they can be queried later, much like records in a datastore. A semantic cache, by contrast, treats embeddings as dynamic memory — governed by similarity, expiration, and adaptive retention. Its goal is not to archive information, but to avoid redundant reasoning across millions of semantically similar requests.

With a semantic cache such as VCAL, cached answers can stay valid for days or weeks, depending on data volatility and TTL settings. This moves caching from short-term repetition avoidance to long-horizon semantic reuse where reasoning itself becomes a reusable resource rather than a recurring cost.

In essence, VCAL bridges the gap between data retrieval and cognitive efficiency, turning past computation into future acceleration.

From data stores to memory layers

In my previous Dev.to article, I explained how we built VCAL, a Rust-based semantic cache that sits between your app and the LLM. Instead of persisting every vector, it memorizes embeddings for a short time, indexed by semantic similarity and metadata.

When a new query arrives, VCAL compares it to cached vectors. If it is close enough — a cache hit — the LLM call is skipped, and the stored answer is returned in milliseconds. Otherwise, the request proceeds normally, and the response is stored for future matches.

The design combines concepts from vector search and traditional caching systems, enhanced with features for resilience and monitoring:

HNSW index for ultra-fast approximate similarity search.
TTL and LRU eviction for automatic cache turnover.
Snapshotting for persistence between restarts.
Prometheus metrics for observability.

All of it runs on-prem, next to your model or gateway with no remote dependencies.

Why local caching changes the economics

Unlike vector databases, a local semantic cache has one simple purpose: avoid redundant reasoning. Each avoided LLM call translates directly into saved tokens, lower API bills, and shorter response times.

In real deployments we’ve seen:

30–60 % reduction in LLM calls
Millisecond-level latency on repeated queries enabling near-real-time responsiveness
Predictable resource usage: no external round-trips, no cloud egress costs, and no multi-tenant contention

At scale, the more your users interact, the greater the savings become. Instead of paying per token for every repetition, you amortize prior reasoning across sessions and teams.

And because VCAL runs inside your private environment, all caching and embeddings stay under your control ensuring data privacy, compliance, and deterministic performance even in regulated industries.

A new layer in the AI stack

If you visualize the modern LLM stack, the simplified design looks like this:

User → Application → LLM Gateway → Model

or, if a RAG (Retrieval-Augmented Generation) framework is involved:

User → Application → Retriever → Vector DB → (context) → LLM Gateway → Model

Adding a semantic cache such as VCAL introduces this new dimension:

User → Application → Retriever → Vector DB → (context) → Semantic Cache → LLM Gateway → Model

Here, the cache checks whether a semantically equivalent query was already answered. If found, the response is returned instantly — skipping tokenization, embedding, and inference altogether. If not, the request continues as usual, and the new answer is stored for future reuse.

Vector databases still matter but they belong to the knowledge layer, not the inference path. What has been missing so far is a memory layer that prevents repeated reasoning altogether. Semantic caching fills the missing “memory” slot in between. It is not a replacement for RAG, it is a complement. While RAG injects context, caching avoids duplication.

Engineering for low latency

Achieving millisecond response times in semantic caching requires more than just a fast similarity search algorithm. It’s the result of careful coordination between data structures, memory layout, and concurrency control.

The cache can be implemented efficiently in systems programming languages such as Rust, using an HNSW-based index for approximate nearest-neighbor search. HNSW provides logarithmic-scale query complexity while maintaining accuracy for large collections of embeddings, making it suitable for workloads that reach millions of cached entries.

Low latency also depends on predictable memory management and lock-free or fine-grained synchronization between threads. Instead of allocating and freeing vectors dynamically, embeddings are often stored in preallocated arenas or memory-mapped regions to minimize fragmentation and system calls. Parallel workers can update the index or evaluate similarity thresholds concurrently, so that retrieval scales with the number of available cores.

In practice, a semantic cache can be deployed as a lightweight service beside an inference gateway, communicating over local HTTP or gRPC. It can also be embedded directly into an application process when minimal overhead is required, for example, within an agent runtime or API handler.

The bigger picture

Caching has always been an invisible driver of performance — from CPU registers that reuse instructions, to CDNs that reuse content, to databases that reuse queries.

Each generation of systems extends the notion of what can be reused. As language models enter production, we are witnessing a shift toward semantic reuse: reusing meaning rather than data. This enables systems to recall previous reasoning instead of repeating it — a step toward more efficient and sustainable AI infrastructure.

In this new layer of the AI stack, semantic caching becomes a form of reasoning memory: it stores the results of understanding, not just storage operations. Instead of recomputing the same insight across thousands of near-identical prompts, we can recall it instantly — with full control over latency, privacy, and cost.

How I Created a Semantic Cache Library for AI

October 27, 2025 · 4 min read

Sergey Lunev

Founder of VCAL Project

Originally published on Dev.to on October 27, 2025.
Read the Dev.to version

Cover

Have you ever wondered why LLM apps get slower and more expensive as they scale, even though 80% of user questions sound pretty similar? That’s exactly what got me thinking recently: why are we constantly asking the model the same thing?

That question led me down the rabbit hole of semantic caching, and eventually to building VCAL (Vector Cache-as-a-Library), an open-source project that helps AI apps remember what they’ve already answered.

The “Eureka!” Moment

It started while optimizing an internal support chatbot that ran on top of a local LLM. Logs showed hundreds of near-identical queries:

“How do I request access to the analytics dashboard?”
“Who approves dashboard access for my team?”
“My access to analytics was revoked — how do I get it back?”

Each one triggered a full LLM inference: embedding the query, generating a new answer, and consuming hundreds of tokens even though all three questions meant the same thing.

So I decided to create a simple library that would embed each question, compare it to what was submitted earlier, and if it’s similar enough, return the stored answer instead of generating an LLM response, all this before asking the model.

I wrote a prototype in Rust — for performance and reliability — and designed it as a small vcal-core open-source library that any app could embed.

The first version of VCAL could:

Store and search vector embeddings in RAM using HNSW graph indexing
Handle TTL and LRU evictions automatically
Save snapshots to disk so it could restart fast

Later came VCAL Server, a drop-in HTTP API version for teams that wanted to cache answers across multiple services while deploying it on-prem or in a cloud.

Screenshot: Grafana dashboard showing cache hits and cost saving

What It Feels Like to Use

Unlike a full vector database, VCAL isn’t designed for long-term storage or analytics. I didn’t want to build another vector database.
VCAL is intentionally lightweight. It is a fast, in-memory semantic cache optimized for repeated LLM queries.

Integrating VCAL takes minutes.
Instead of calling the model directly, you send your query to VCAL first.
If a similar question has been asked before — and the similarity threshold can be tuned — VCAL returns the answer from its cache in milliseconds. If it’s a new question, VCAL asks the LLM, stores the result, and returns it.
Next time, if a semantically similar question comes in, VCAL answers instantly.

It’s like adding a memory layer between your app and the model — lightweight, explainable, and under your full control.

Flow diagram: user → VCAL → LLM

Lessons Learned

LLMs love redundancy. Once you start caching semantically, you realize how often people repeat the same question with different words.
Caching semantics ≠ caching text. Cosine similarity and vector distances matter more than exact matches.
Performance scales beautifully. A well-tuned cache can handle thousands of lookups per second, even on modest hardware.
It scales big. A single VCAL Server instance can comfortably store and serve up to 10 million cached answers in memory, depending on embedding dimensions and hardware.

What’s Next

We’re now working on a licensing server, enterprise snapshot formats, and RAG-style extensions, so teams can use VCAL not just for Q&A caching, but as the foundation for private semantic memory.

If you’re building AI agents, support desks, or knowledge assistants, you’ll likely benefit from giving your system a brain that remembers.

You can explore more at vcal-project.com - try the free 30-day Trial Evaluation of VCAL Server or jump into the open-source vcal-core version on GitHub.

Thanks for reading!
If this resonates with you, please drop a comment. I’d love to hear how you’re approaching caching and optimization for AI apps.

Why meaning moved from definitions to structure — and what that changed for modern AI​

Linguistic Origins: Meaning as a System, Not a Label​

Philosophy of Language: Meaning, Logic, and Composition​

Early Artificial Intelligence: When Meaning Was Symbolic​

The Statistical Shift: Meaning Emerges from Usage​

Word Embeddings: Geometry Becomes Semantics​

Contextual Semantics: Meaning Is Not Fixed​

Large Language Models: Semantics as Emergent Structure​

When Meaning Becomes Operational​

Closing Thought: Semantics as Shared Infrastructure​

When “intelligence” wastes cycles​

Vector databases solved storage, not reuse​

From data stores to memory layers​

Why local caching changes the economics​

A new layer in the AI stack​

Engineering for low latency​

The bigger picture​

Further reading​

The “Eureka!” Moment​

What It Feels Like to Use​

Lessons Learned​

What’s Next​

Why meaning moved from definitions to structure — and what that changed for modern AI

Linguistic Origins: Meaning as a System, Not a Label

Philosophy of Language: Meaning, Logic, and Composition

Early Artificial Intelligence: When Meaning Was Symbolic

The Statistical Shift: Meaning Emerges from Usage

Word Embeddings: Geometry Becomes Semantics

Contextual Semantics: Meaning Is Not Fixed

Large Language Models: Semantics as Emergent Structure

When Meaning Becomes Operational

Closing Thought: Semantics as Shared Infrastructure

When “intelligence” wastes cycles

Vector databases solved storage, not reuse

From data stores to memory layers

Why local caching changes the economics

A new layer in the AI stack

Engineering for low latency

The bigger picture

Further reading

The “Eureka!” Moment

What It Feels Like to Use

Lessons Learned

What’s Next