The EOS Token Isn't Actually the Best Way to Embed Code

The EOS Token Isn't Actually the Best Way to Embed Code

💻 Adaptive Cross-Attention Pooling Implementation

Replace EOS token pooling with this superior method for code embeddings.

import torch
import torch.nn as nn
import torch.nn.functional as F

class AdaptiveCrossAttentionPooling(nn.Module):
    """
    Replaces standard EOS token pooling with adaptive cross-attention.
    Dynamically weights all tokens in the sequence for better embeddings.
    """
    
    def __init__(self, hidden_size, num_heads=8):
        super().__init__()
        self.num_heads = num_heads
        self.head_dim = hidden_size // num_heads
        
        # Learnable query vector (replaces static EOS token)
        self.query = nn.Parameter(torch.randn(1, 1, hidden_size))
        
        # Multi-head attention components
        self.q_proj = nn.Linear(hidden_size, hidden_size)
        self.k_proj = nn.Linear(hidden_size, hidden_size)
        self.v_proj = nn.Linear(hidden_size, hidden_size)
        self.out_proj = nn.Linear(hidden_size, hidden_size)
        
    def forward(self, hidden_states, attention_mask=None):
        """
        Args:
            hidden_states: [batch_size, seq_len, hidden_size]
            attention_mask: [batch_size, seq_len]
        Returns:
            pooled_output: [batch_size, hidden_size]
        """
        batch_size = hidden_states.size(0)
        
        # Project to query, key, value spaces
        q = self.q_proj(self.query).repeat(batch_size, 1, 1)  # [batch, 1, hidden]
        k = self.k_proj(hidden_states)  # [batch, seq_len, hidden]
        v = self.v_proj(hidden_states)  # [batch, seq_len, hidden]
        
        # Reshape for multi-head attention
        q = q.view(batch_size, 1, self.num_heads, self.head_dim).transpose(1, 2)
        k = k.view(batch_size, -1, self.num_heads, self.head_dim).transpose(1, 2)
        v = v.view(batch_size, -1, self.num_heads, self.head_dim).transpose(1, 2)
        
        # Compute attention scores
        attn_scores = torch.matmul(q, k.transpose(-2, -1)) / (self.head_dim ** 0.5)
        
        # Apply attention mask if provided
        if attention_mask is not None:
            attn_mask = attention_mask.unsqueeze(1).unsqueeze(2)
            attn_scores = attn_scores.masked_fill(attn_mask == 0, -1e9)
        
        # Softmax and weighted sum
        attn_weights = F.softmax(attn_scores, dim=-1)
        context = torch.matmul(attn_weights, v)
        
        # Reshape and project back
        context = context.transpose(1, 2).contiguous().view(batch_size, 1, -1)
        pooled_output = self.out_proj(context).squeeze(1)
        
        return pooled_output

# Usage example:
# model = YourLLMModel()
# hidden_states = model(input_ids)
# pooler = AdaptiveCrossAttentionPooling(hidden_size=768)
# embeddings = pooler(hidden_states)

The Bottleneck We Didn't Know We Had

For years, the standard playbook for creating dense vector embeddings from large language models has relied on a simple trick: take the hidden state of the final token—often the End-of-Sequence (EOS) token—and use that as the singular representation for the entire input sequence. This method powers everything from semantic search in code repositories to AI-powered autocomplete. It's efficient, straightforward, and deeply ingrained in the infrastructure of modern developer tools. But what if this foundational technique is fundamentally limiting our models' understanding of complex code?

This is the provocative question at the heart of the C2LLM (Contrastive Code Large Language Models) technical report. The research presents a family of code embedding models in 0.5B and 7B parameter sizes that challenge the EOS-centric status quo. By building on the capable Qwen-2.5-Coder backbones and introducing a novel Pooling by Multihead Attention (PMA) module, C2LLM demonstrates that there's a better way to distill the meaning of a code snippet into a single vector. The results aren't marginal; they suggest we've been operating with an unnecessary information bottleneck.

Why the EOS Token Fails Complex Code

The logic behind using the EOS token's representation is seductive in its simplicity. The model processes the entire sequence token-by-token, with each step building a contextual understanding. By the time it reaches the final token, the theory goes, the model's hidden state should encapsulate the cumulative meaning of everything that came before. For natural language paragraphs, this often works reasonably well.

Code, however, is a different beast. Its semantics are not strictly linear. The importance of a function definition isn't fully realized until its later invocations. A variable declared early might be critical for logic that appears hundreds of lines later. A class's behavior is defined by an interplay of methods scattered throughout its body. Relying on the final token's context—which is heavily weighted toward the very end of the sequence—can mean underrepresenting crucial information from the beginning or middle of a file. The EOS token becomes an information chokepoint, forced to carry the weight of the entire program's meaning through a single, final vector.

C2LLM's approach is to break this bottleneck. Instead of anointing one token's hidden state as the king, the PMA module performs an adaptive pooling operation across all token embeddings. It's a form of cross-attention where a small set of learnable query vectors attends to the entire sequence of token representations produced by the LLM backbone. This allows the model to dynamically decide which parts of the code are most salient for creating a holistic embedding, aggregating information from tokens regardless of their position.

How Adaptive Cross-Attention Pooling Works

The technical elegance of the PMA module lies in its marriage of pre-existing knowledge and new learning. The Qwen-2.5-Coder backbone is a powerful, pretrained model that already understands code syntax and semantics. Its causal attention mechanism gives each token a rich representation informed by all preceding tokens. The PMA module doesn't discard this; it builds upon it.

Think of it this way: the LLM backbone is an expert code reader, producing nuanced impressions for every line. The PMA module is a savvy editor who reads all those impressions and writes a concise, comprehensive summary. It uses multihead attention—a mechanism these models already excel at—to let a handful of summary "query" vectors look over the entire sequence of token embeddings. Each query can focus on different aspects: one might attend to function signatures and imports, another to control flow logic, another to data structure manipulations. The outputs of these queries are then combined to form the final, fixed-dimensional sequence embedding.

This method delivers a dual advantage. First, it fully utilizes the rich, causal representations the LLM painstakingly learned during pretraining. Second, it liberates the model from the positional tyranny of the EOS token. Information from a critical `import` statement at the top of a file or a key function definition can be weighted just as heavily as the final `return` statement.

The Tangible Impact: Beyond Academic Benchmarks

The report validates C2LLM on standard code retrieval tasks, where the model must find semantically similar code snippets from a large database. The results show consistent and significant improvements over strong baselines that use mean pooling or EOS token pooling. But the real-world implications extend far beyond leaderboard scores.

Consider a developer searching their massive monorepo for "code that parses JSON and then validates the schema." An EOS-based embedding might over-index on the validation logic if it appears last. A C2LLM embedding, synthesizing information from both the parsing and validation sections, could retrieve a more semantically precise match. In pair programming assistants, better embeddings mean better context understanding, leading to more relevant suggestions. For code duplication detection or vulnerability scanning, a model that grasps the full semantic weight of a code block, not just its conclusion, will be more accurate.

The release of both 0.5B and 7B models is a strategic masterstroke. The smaller model offers a path for integration into latency-sensitive environments like IDEs or lightweight CI/CD pipelines. The larger 7B model provides state-of-the-art accuracy for offline analysis, research, and complex enterprise search systems. This bifurcation acknowledges that the "best" model depends entirely on the trade-off between performance and practical constraints.

The New Frontier: Rethinking Embedding Generation

C2LLM's success with PMA should act as a catalyst, prompting a reevaluation of pooling strategies across the board. The field has long treated the pooling step as a simple, almost trivial, post-processing operation after the "real" work of the LLM is done. This research proves it is a critical component worthy of architectural innovation.

The immediate implication is that any team building code intelligence features—from GitHub and GitLab to startups in the AI-for-dev space—should be examining this approach. The performance lift is compelling. The deeper implication is conceptual: our models' understanding of code is only as good as the embeddings we extract from them. If we use a crude method to create those embeddings, we're wasting the sophisticated understanding the model has already developed.

What's next? The principles of C2LLM are not limited to code. Any domain with long-range, non-linear dependencies—legal documents, scientific papers, technical manuals—could benefit from moving beyond last-token pooling. The research opens the door to a new wave of embedding models that are not just larger, but smarter in how they condense information.

The Bottom Line for Developers

The myth was that the final token knew best. The reality, as demonstrated by C2LLM, is that wisdom is distributed across the entire sequence. The next generation of developer tools won't just be powered by bigger LLMs; they'll be powered by smarter methods of harnessing what those LLMs already know. The era of treating embedding generation as an afterthought is over. For developers, this translates to more accurate search, more context-aware assistance, and ultimately, tools that genuinely understand the structure and intent of their code, not just the last line they typed.

The takeaway is clear: pay attention to how your AI tools create embeddings. The difference between a good and a great code AI might not be the model size, but the pooling technique. As this research spreads, expect the humble EOS token to be dethroned as the default source of truth, making way for more nuanced, adaptive, and powerful representations of what we build.

💬 Discussion

Add a Comment

0/5000
Loading comments...