DSPy in Production: From Manual Prompts to Optimized Pipelines
How we improved Cohen's κ from 0.55 to 0.85 using DSPy for automated prompt optimization.
The Problem
At HTC Global Services, we built a GenAI call-center QA auditor that needed to score customer service calls on multiple dimensions: professionalism, problem resolution, compliance, and empathy.
The initial approach used hand-crafted prompts with GPT-4o. It worked, but inter-rater reliability (measured by Cohen's κ) hovered around 0.55-0.60—barely acceptable for production use.
Why DSPy?
Traditional prompt engineering is:
- Brittle: Small prompt changes cause unpredictable output shifts
- Non-transferable: Prompts optimized for one model rarely work for others
- Hard to iterate: No systematic way to improve
DSPy treats prompts as learnable parameters, not fixed strings. Instead of writing prompts, you define:
- Signatures: Input/output specifications
- Modules: Composable building blocks
- Metrics: What "good" looks like
Then DSPy optimizes the prompts automatically.
Our Implementation
Defining the Signature
class CallQualityAssessment(dspy.Signature): """Assess call quality based on transcript and scoring rubric.""" transcript: str = dspy.InputField(desc="Full call transcript with timestamps") rubric: str = dspy.InputField(desc="Scoring criteria and examples") professionalism_score: int = dspy.OutputField(desc="1-5 score") professionalism_evidence: str = dspy.OutputField(desc="Specific quotes supporting score") resolution_score: int = dspy.OutputField(desc="1-5 score") resolution_evidence: str = dspy.OutputField(desc="Specific quotes supporting score")
The Key Insight: Evidence Enforcement
The breakthrough came from requiring timestamp evidence for every score. This wasn't just for explainability—it forced the model to ground assessments in specific transcript moments.
class QAModule(dspy.Module): def __init__(self): self.assess = dspy.ChainOfThought(CallQualityAssessment) def forward(self, transcript, rubric): result = self.assess(transcript=transcript, rubric=rubric) # Validate evidence contains timestamps if not self._has_valid_timestamps(result.professionalism_evidence): raise dspy.AssertionError("Evidence must include timestamps") return result
Optimization with Human Feedback
We used 50 human-scored calls as our training set:
from dspy.teleprompt import BootstrapFewShot def metric(gold, pred): # Cohen's kappa between human and model scores return cohen_kappa_score(gold.scores, pred.scores) optimizer = BootstrapFewShot(metric=metric, max_bootstrapped_demos=4) optimized_module = optimizer.compile(QAModule(), trainset=human_scored_calls)
Results
After DSPy optimization:
| Metric | Before | After | |--------|--------|-------| | Cohen's κ (professionalism) | 0.58 | 0.82 | | Cohen's κ (resolution) | 0.55 | 0.85 | | Cohen's κ (compliance) | 0.62 | 0.88 | | Avg. assessment time | 45s | 12s |
Business Impact
- $100K/year QA cost savings: Reduced manual review by 70%
- Scaled to 10K+ calls/month: Previously only 2K were reviewed
- ~70% more root causes surfaced: Systematic coverage caught patterns human spot-checking missed
Lessons for Production DSPy
- Start with clear metrics: DSPy needs a numeric signal to optimize against
- Assertions are powerful: Use them to enforce output structure
- Evidence requirements improve reliability: Grounding reduces hallucination
- Human-in-the-loop matters: Our best results came from continuous feedback integration
What's Next
We're exploring DSPy's newer optimizers (MIPROv2) and multi-stage pipelines for more complex assessments. The framework's ability to treat prompts as learnable parameters has fundamentally changed how we approach LLM applications.