Technology

Unlocking LLMs Beyond Text: Custom Heads for Smarter AI Tasks

Unlocking LLMs Beyond Text Generation: Custom Heads for Smarter AI Tasks

Hey there, fellow AI tinkerers. If you've ever heard someone say, "If your LLM model is used to generate text, you are not using it correctly," it might sound a bit extreme at first. But there's truth in it—large language models like Llama or Mistral aren't just chatty bots churning out stories or emails. At their core, they're powerful encoders of language understanding, and by slapping on custom "heads" (those output layers you attach to the frozen base model), you can repurpose them for all sorts of non-generative tasks. Think classification, embeddings, reward scoring, or even tool calling without the autoregressive hassle.

This isn't about ditching text gen entirely—it's about expanding your toolkit. Custom heads let you fine-tune just a tiny fraction of parameters (often <1% of the total) while leveraging the pre-trained magic of the LLM backbone. It's efficient, low-VRAM, and opens doors to real-world apps like toxicity detection or fact-checking. We'll break it down by usage, with examples, pseudo-code, and nods to deployed models. No fluff, just calm, practical insights to get you experimenting.

If your LLM model is used to generate text, you are not using it correctly illustration

Reward Modeling: Scoring Preferences Without the Drama

One of the coolest non-text uses is building reward models for RLHF or RLAIF. Instead of generating responses, your LLM judges them—outputting a scalar score for how "helpful" or "harmless" a completion is. This powers alignment in models like ChatGPT, but you can do it yourself with a simple linear head.

Take Starling-RM-7B-alpha from Berkeley: It's a Llama2-7B base with a linear head outputting a single scalar. Trained on GPT-4 preferences via Bradley-Terry loss, it scores prompt-response pairs higher for helpful, low-harm outputs. Real-world? Use it to filter toxic generations or rank candidates in Best-of-N sampling. VRAM hit? Negligible at <1MB for fp16 inference.

Pseudo-code to get started (using PyTorch and Hugging Face Transformers):

import torch
import torch.nn as nn
from transformers import AutoModelForCausalLM

class RewardModel(nn.Module):
    def __init__(self, base_model_name):
        super().__init__()
        self.base = AutoModelForCausalLM.from_pretrained(base_model_name)
        hidden_size = self.base.config.hidden_size
        self.reward_head = nn.Linear(hidden_size, 1)  # Scalar output

    def forward(self, input_ids, attention_mask):
        outputs = self.base(input_ids=input_ids, attention_mask=attention_mask)
        # Pool to CLS token or mean
        pooled = outputs.last_hidden_state[:, 0]  # Assuming causal LM, use first token
        reward = self.reward_head(pooled).squeeze(-1)
        return reward  # Higher = better preference

# Usage example
model = RewardModel("meta-llama/Llama-2-7b-hf")
prompt_response = tokenizer("User: What's AI? Assistant: AI is...", return_tensors="pt")
scores = model(**prompt_response)
print(f"Helpfulness score: {scores.item()}")

Train on datasets like Anthropic's HH-RLHF: Pair prompts with chosen/rejected responses, minimize sigmoid loss on score differences. As per the RM-R1 paper (arXiv:2505.02387), adding reasoning traces boosts accuracy by 13.8% on RewardBench. Calm fact: This scales—DeepSeek-R1 uses similar for verifiable rewards without human labels. Check Hugging Face for Starling-RM: berkeley-nest/Starling-RM-7B-alpha.

Classification Heads: Quick Decisions on Sentiment, Toxicity, and More

Why generate a paragraph when a yes/no or category suffices? Bolt on a linear classification head for tasks like sentiment analysis, spam detection, or toxicity flagging. It's negligible overhead—8-40K params, <1MB VRAM—and deploys everywhere from content moderation to email filters.

Widely used in 2025: Models like ArmoRM-L1B for toxicity, trained on Jigsaw datasets with categories (toxic, obscene, threat). Input a comment, output logits for 2-10 classes. Pseudo-code:

import torch.nn as nn

class ClassificationHead(nn.Module):
    def __init__(self, base_model, num_classes):
        super().__init__()
        self.base = base_model
        self.classifier = nn.Linear(base_model.config.hidden_size, num_classes)

    def forward(self, input_ids, attention_mask):
        outputs = self.base(input_ids=input_ids, attention_mask=attention_mask)
        pooled = outputs.pooler_output if hasattr(outputs, 'pooler_output') else outputs.last_hidden_state.mean(dim=1)
        logits = self.classifier(pooled)
        return logits  # Softmax for probs

# Train with cross-entropy on labeled data like UCI SMS Spam

For toxicity, fine-tune on YouTube comments (Kaggle dataset) or Jigsaw's 200K+ annotations. The paper "Classification of Intent in Moderating Online Discussions" (ScienceDirect, 2024) shows LLMs with such heads hit 80%+ accuracy on multi-label toxicity, beating traditional ML. Calm tip: Start with frozen base, LoRA on the head for efficiency. Reference: Jigsaw Toxic Comment Classification.

Embeddings and Retrieval: Vector Magic for Search and Reranking

Embeddings turn text into dense vectors for similarity search—perfect for RAG without generation. Use an MLP head (8-20M params, 30-80MB VRAM) on the pooled output. Snowflake's Arctic-Embed-L-v2.0 (568M params) is a beast here: Multilingual, 1024D vectors via CLS pooling, optimized for retrieval on BEIR benchmarks.

Real use: Duplicate detection or reranking in search pipelines. Multi-head contrastive (Siamese) setups train two branches to pull similar pairs close, push dissimilar apart—60-150MB VRAM. The CoRe heads paper (arXiv:2510.02219) isolates <1% of attention heads for contrastive reranking, boosting BEIR by 20% with 40% less memory.

Pseudo-code for a basic embedding head:

class EmbeddingHead(nn.Module):
    def __init__(self, base_model, embed_dim=1024):
        super().__init__()
        self.base = base_model
        if base_model.config.hidden_size != embed_dim:
            self.proj = nn.Linear(base_model.config.hidden_size, embed_dim)
        else:
            self.proj = None

    def forward(self, input_ids, attention_mask):
        outputs = self.base(input_ids=input_ids, attention_mask=attention_mask)
        pooled = outputs.last_hidden_state.mean(dim=1)  # Mean pooling
        embedding = self.proj(pooled) if self.proj else pooled
        return nn.functional.normalize(embedding, p=2, dim=1)  # L2 norm

# For contrastive: Use InfoNCE loss on pairs

Train on SNLI or MS MARCO for contrastive learning. Voyage-lite and BGE-small-v2 deploy this for sentence embeddings—check Snowflake Arctic Embed. For reranking, aggregate CoRe heads: Prune 50% layers, cut latency 20%.

Sequence Tagging and NER: Extracting Entities Without Spans

For PII redaction or slot filling, sequence tagging heads label each token (e.g., NER: PERSON, ORG). CRF or per-token linear (4096 x n_tags, <50MB VRAM) on hidden states. Private AI uses this for detecting names/emails in text/files via their ner/text endpoint.

Example: Snips dataset for slot filling ("Book a flight to Paris" → location: Paris). BiLSTM or BERT heads hit 96% accuracy. Pseudo-code:

class TaggingHead(nn.Module):
    def __init__(self, base_model, num_tags):
        super().__init__()
        self.base = base_model
        self.tagger = nn.Linear(base_model.config.hidden_size, num_tags)

    def forward(self, input_ids, attention_mask):
        outputs = self.base(input_ids=input_ids, attention_mask=attention_mask)
        hidden = outputs.last_hidden_state
        logits = self.tagger(hidden)  # Shape: (batch, seq_len, num_tags)
        return logits  # CRF decode for best sequence

Fine-tune on CoNLL-2003 (Hugging Face datasets). The "Using Private AI as an NER Engine" guide details overlapping entity detection. Reference: CoNLL-2003 Dataset.

Span Extraction and QA: Pinpointing Answers

Extractive QA like SQuAD uses two linear heads for start/end logits (<10MB VRAM). Input context+question, output token spans. Fin-ExBERT (arXiv:2509.23259) adapts BERT for financial transcripts, scoring 4.93/5 on judges.

Pseudo-code:

class SpanHead(nn.Module):
    def __init__(self, base_model):
        super().__init__()
        self.base = base_model
        hidden = self.base.config.hidden_size
        self.start_logits = nn.Linear(hidden, 1)
        self.end_logits = nn.Linear(hidden, 1)

    def forward(self, input_ids, attention_mask):
        outputs = self.base(input_ids=input_ids, attention_mask=attention_mask)
        hidden = outputs.last_hidden_state
        start = self.start_logits(hidden).squeeze(-1)
        end = self.end_logits(hidden).squeeze(-1)
        return start, end  # Argmax for spans

Post-process: Sum logits, filter valid spans ≤30 tokens. Hits 88% F1 on SQuAD. See Hugging Face's QA chapter: Question Answering Tutorial.

Tool Calling and Verification: Agents Without Generation Loops

Tool-calling heads output parallel logits over n_tools (1-5MB VRAM) for single-pass function selection—faster than ReAct loops. DeepSeek-R1 supports this natively, calling weather APIs via JSON schemas.

Verification heads (8-20M params) for RAG fact-checking: Entailment logits (entail/contradict/neutral). Atlas-1B uses this to verify claims.

Pseudo-code for tool head:

class ToolHead(nn.Module):
    def __init__(self, base_model, num_tools):
        super().__init__()
        self.base = base_model
        self.tool_logits = nn.Linear(base_model.config.hidden_size, num_tools)

    def forward(self, input_ids, attention_mask):
        outputs = self.base(input_ids=input_ids, attention_mask=attention_mask)
        pooled = outputs.last_hidden_state[:, 0]
        logits = self.tool_logits(pooled)
        return logits  # Top tool via argmax

For DeepSeek: Define schemas, set tool_choice="auto". vLLM enables this out-of-box: vLLM Tool Calling.

Uncertainty and Regression: Calibrating Confidence

Add a regression head (2 outputs, negligible VRAM) for confidence scores + uncertainty. FineCE (arXiv:2508.12040) integrates this during generation, detecting correct answers early (39.5% accuracy boost).

Pseudo-code:

class UncertaintyHead(nn.Module):
    def __init__(self, base_model):
        super().__init__()
        self.base = base_model
        self.reg_head = nn.Linear(base_model.config.hidden_size, 2)  # Mean + variance

    def forward(self, input_ids, attention_mask):
        outputs = self.base(input_ids=input_ids, attention_mask=attention_mask)
        pooled = outputs.last_hidden_state.mean(dim=1)
        mu, logvar = self.reg_head(pooled).chunk(2, dim=-1)
        return mu, logvar.exp()  # Confidence as 1 / uncertainty

Train on GSM8K with Monte Carlo samples. Survey on UQ in LLMs (ACM, 2025) covers calibration methods.

MoE Heads: Multi-Task Superpowers

For ultra-multi-tasking, MoE heads (100-300M params, 400MB-1GB VRAM) route to 8+ experts. Gorilla-1B uses 100+ tool heads in MoE for function calling. OLMoE (1B active/7B total) deploys on edge devices.

Pseudo-code:

class MoEHead(nn.Module):
    def __init__(self, input_dim, num_experts=8, expert_dim=512):
        super().__init__()
        self.gate = nn.Linear(input_dim, num_experts)
        self.experts = nn.ModuleList([nn.Linear(input_dim, expert_dim) for _ in range(num_experts)])

    def forward(self, x):  # x from LLM pooled
        gates = nn.functional.softmax(self.gate(x), dim=-1)
        expert_outs = torch.stack([exp(x) for exp in self.experts], dim=-1)
        output = torch.sum(gates.unsqueeze(-2) * expert_outs, dim=-1)
        return output

Balances load, activates 2-4 experts. See OLMoE Paper.

Wrapping Up: Heads Up, Experiment Away

Custom heads transform LLMs from text spinners to versatile tools—classification for moderation, embeddings for search, rewards for alignment, all with minimal tweaks. Libraries like transformer-heads (Reddit, 2024) make attaching them a breeze. Dive into Hugging Face for bases, fine-tune with PEFT/LoRA, and test on benchmarks like RewardBench or BEIR. It's a calmer, more controlled way to wield AI power. What's your first head project?