Back to blog
GenAINovember 20, 20246 min read

DSPy in Production: From Manual Prompts to Optimized Pipelines

How we improved Cohen's κ from 0.55 to 0.85 using DSPy for automated prompt optimization.

The Problem

At HTC Global Services, we built a GenAI call-center QA auditor that needed to score customer service calls on multiple dimensions: professionalism, problem resolution, compliance, and empathy.

The initial approach used hand-crafted prompts with GPT-4o. It worked, but inter-rater reliability (measured by Cohen's κ) hovered around 0.55-0.60—barely acceptable for production use.

Why DSPy?

Traditional prompt engineering is:

  • Brittle: Small prompt changes cause unpredictable output shifts
  • Non-transferable: Prompts optimized for one model rarely work for others
  • Hard to iterate: No systematic way to improve

DSPy treats prompts as learnable parameters, not fixed strings. Instead of writing prompts, you define:

  1. Signatures: Input/output specifications
  2. Modules: Composable building blocks
  3. Metrics: What "good" looks like

Then DSPy optimizes the prompts automatically.

Our Implementation

Defining the Signature

class CallQualityAssessment(dspy.Signature):
    """Assess call quality based on transcript and scoring rubric."""

    transcript: str = dspy.InputField(desc="Full call transcript with timestamps")
    rubric: str = dspy.InputField(desc="Scoring criteria and examples")

    professionalism_score: int = dspy.OutputField(desc="1-5 score")
    professionalism_evidence: str = dspy.OutputField(desc="Specific quotes supporting score")
    resolution_score: int = dspy.OutputField(desc="1-5 score")
    resolution_evidence: str = dspy.OutputField(desc="Specific quotes supporting score")

The Key Insight: Evidence Enforcement

The breakthrough came from requiring timestamp evidence for every score. This wasn't just for explainability—it forced the model to ground assessments in specific transcript moments.

class QAModule(dspy.Module):
    def __init__(self):
        self.assess = dspy.ChainOfThought(CallQualityAssessment)

    def forward(self, transcript, rubric):
        result = self.assess(transcript=transcript, rubric=rubric)
        # Validate evidence contains timestamps
        if not self._has_valid_timestamps(result.professionalism_evidence):
            raise dspy.AssertionError("Evidence must include timestamps")
        return result

Optimization with Human Feedback

We used 50 human-scored calls as our training set:

from dspy.teleprompt import BootstrapFewShot

def metric(gold, pred):
    # Cohen's kappa between human and model scores
    return cohen_kappa_score(gold.scores, pred.scores)

optimizer = BootstrapFewShot(metric=metric, max_bootstrapped_demos=4)
optimized_module = optimizer.compile(QAModule(), trainset=human_scored_calls)

Results

After DSPy optimization:

| Metric | Before | After | |--------|--------|-------| | Cohen's κ (professionalism) | 0.58 | 0.82 | | Cohen's κ (resolution) | 0.55 | 0.85 | | Cohen's κ (compliance) | 0.62 | 0.88 | | Avg. assessment time | 45s | 12s |

Business Impact

  • $100K/year QA cost savings: Reduced manual review by 70%
  • Scaled to 10K+ calls/month: Previously only 2K were reviewed
  • ~70% more root causes surfaced: Systematic coverage caught patterns human spot-checking missed

Lessons for Production DSPy

  1. Start with clear metrics: DSPy needs a numeric signal to optimize against
  2. Assertions are powerful: Use them to enforce output structure
  3. Evidence requirements improve reliability: Grounding reduces hallucination
  4. Human-in-the-loop matters: Our best results came from continuous feedback integration

What's Next

We're exploring DSPy's newer optimizers (MIPROv2) and multi-stage pipelines for more complex assessments. The framework's ability to treat prompts as learnable parameters has fundamentally changed how we approach LLM applications.