Skip to main content

Training Database

The Training Database defines how MOOD MNKY agents are trained, evaluated, and improved over time. It provides a structured approach for maintaining training data, evaluation processes, and performance metrics to ensure agents deliver high-quality, consistent experiences.

Purpose and Role

The Training Database defines “how” agents learn and improve, ensuring continuous refinement of their capabilities and performance.
This database documents:
  • Training datasets and their characteristics
  • Evaluation methodologies and benchmarks
  • Performance metrics and improvement targets
  • Feedback collection and incorporation
  • Model version control and release management
  • Quality assurance processes and standards

Schema and Structure

  • Database Schema
  • Example Entry
{
  "training_id": "string",
  "agent_id": "string",
  "dataset_id": "string",
  "model_version": "string",
  "training_type": "string",
  "parameters": "object",
  "metrics": "object",
  "status": "string",
  "created_at": "timestamp",
  "completed_at": "timestamp",
  "created_by": "string",
  "artifacts": "array",
  "notes": "string"
}

Field Descriptions

A unique identifier for the training record.
The ID of the agent being trained.
The ID of the dataset used for training.
The base model version or identifier.
The type of training performed (fine-tuning, prompt engineering, etc.).
Training parameters and configuration.
Performance metrics and evaluation results.
Current status of the training (pending, in_progress, completed, failed).
When the training was initiated.
When the training was completed.
The user or system that initiated the training.
Paths to model files, logs, and other training artifacts.
Additional information and observations about the training.

Training Methodologies

The MOOD MNKY agent system employs various approaches to training and improving agents:

Prompt Engineering

Systematic refinement of instructions, examples, and constraints to optimize agent behavior without modifying the underlying model.
  • System message optimization
  • Few-shot example curation
  • Response format structuring
  • Constraint definition
  • Chain-of-thought guidance

Fine-tuning

Adjustment of model weights using carefully curated datasets to improve performance on specific tasks and align with brand voice.
  • Response quality improvements
  • Domain-specific knowledge
  • Brand tone and voice alignment
  • Specialized capability enhancement
  • Error reduction for common cases

Retrieval Augmentation

Enhancement of agent capabilities by integrating external knowledge sources and dynamically retrieved context.
  • Knowledge base integration
  • Vector store implementation
  • Chunking and indexing strategies
  • Query formulation optimization
  • Context window management

Behavioral Alignment

Techniques to ensure agent outputs align with desired behaviors, safety standards, and ethical guidelines.
  • Constitutional AI approaches
  • Reinforcement Learning from Human Feedback
  • Safety boundary implementation
  • Bias detection and mitigation
  • Response quality control

Training Datasets

The system maintains several types of datasets for agent training and evaluation:
  • Conversation Datasets
  • Task-specific Datasets
  • Evaluation Benchmarks
conversation_dataset = [
  {
    "id": "conv_78325a",
    "messages": [
      {"role": "user", "content": "I'm looking for something to help me relax in the evening."},
      {"role": "assistant", "content": "I'd be happy to recommend some relaxing products for your evening routine. Our Lavender Dreams candle and Sleep Well pillow mist are both popular choices for relaxation. The candle features lavender and chamomile notes, while the mist blends lavender with bergamot and cedarwood. Would you prefer something for your space, like the candle, or a more personal product like the pillow mist?"},
      {"role": "user", "content": "The pillow mist sounds nice. Does it work well for sensitive skin?"},
      {"role": "assistant", "content": "Yes, our Sleep Well pillow mist is formulated with sensitive skin in mind. It's designed to be sprayed on your pillow or linens rather than directly on the skin. The formula is free from parabens, phthalates, and artificial colors. All ingredients are clearly listed on our website, and many customers with sensitive skin have reported positive experiences. Would you like me to share the full ingredient list with you?"}
    ],
    "metadata": {
      "task": "product_recommendation",
      "quality_rating": 4.8,
      "source": "customer_service",
      "annotator": "product_specialist",
      "tags": ["sensitivity", "sleep", "recommendations"]
    }
  },
  # More conversation examples...
]

Evaluation Framework

Core Metrics

Functional Accuracy

  • Task completion rate
  • Information accuracy
  • Procedural correctness
  • Error rate
  • Recovery capability

User Experience

  • Response relevance
  • Helpfulness rating
  • User satisfaction
  • Conversation flow
  • Clarity and conciseness

Brand Alignment

  • Tone consistency
  • Brand value reflection
  • Voice appropriateness
  • Messaging alignment
  • Visual harmony

Safety

  • Policy compliance
  • Boundary adherence
  • Refusal quality
  • Risk mitigation
  • Content appropriateness

Efficiency

  • Response time
  • Resolution speed
  • Turn efficiency
  • Resource utilization
  • Cost-effectiveness

Adaptability

  • Context handling
  • Ambiguity resolution
  • Error recovery
  • Flexibility
  • Learning application

Evaluation Processes

from agents import Agent, Runner
from evaluation import EvaluationFramework

# Initialize evaluation framework
evaluator = EvaluationFramework()

async def evaluate_agent_version(agent_id: str, version: str, 
                                 benchmark_id: str) -> dict:
    """Evaluate a specific agent version against a benchmark."""
    # Load agent
    agent = await load_agent(agent_id, version)
    
    # Load benchmark
    benchmark = evaluator.load_benchmark(benchmark_id)
    
    # Run evaluation
    results = await evaluator.run_benchmark(
        agent=agent,
        benchmark=benchmark,
        iterations=100,
        parallel=True
    )
    
    # Analyze results
    analysis = evaluator.analyze_results(results)
    
    # Store evaluation record
    evaluation_record = evaluator.store_evaluation(
        agent_id=agent_id,
        version=version,
        benchmark_id=benchmark_id,
        results=results,
        analysis=analysis
    )
    
    return evaluation_record

# Example evaluation
results = await evaluate_agent_version(
    agent_id="mood_mnky_001",
    version="v1.2.3",
    benchmark_id="benchmark_core_capabilities"
)

Integration with OpenAI Agents SDK

Agent Version Management

from agents import Agent, Runner
from training import AgentVersionManager

# Initialize version manager
version_manager = AgentVersionManager()

# Create new agent version
new_version = await version_manager.create_version(
    agent_id="mood_mnky_001",
    base_version="v1.2.3",
    changes=[
        {"type": "instruction_update", "content": "Improved product recommendation guidance"},
        {"type": "parameter_update", "name": "temperature", "value": 0.7},
        {"type": "tool_addition", "tool_id": "scent_analyzer"}
    ],
    metadata={
        "author": "ai_team_lead",
        "purpose": "Enhance recommendation quality",
        "ticket": "MNKY-452"
    }
)

# Load specific agent version
agent = await version_manager.load_version(
    agent_id="mood_mnky_001",
    version="v1.3.0"
)

# Run the agent
runner = Runner()
result = await runner.run(agent, "I'm looking for a gift for someone who loves relaxing baths")

A/B Testing Implementation

from agents import Agent, Runner
from training import ABTestManager
import random

# Initialize A/B test manager
ab_test = ABTestManager()

# Create A/B test
test_id = await ab_test.create_test(
    name="product_recommendation_approach",
    variants={
        "control": "mood_mnky_001:v1.2.3",
        "variant_a": "mood_mnky_001:v1.3.0",
        "variant_b": "mood_mnky_001:v1.3.1"
    },
    allocation={
        "control": 0.4,
        "variant_a": 0.3,
        "variant_b": 0.3
    },
    metrics=[
        "conversion_rate",
        "satisfaction_score",
        "average_response_time"
    ],
    start_date="2024-05-15",
    end_date="2024-05-29"
)

async def get_ab_test_agent(user_id: str, test_id: str) -> Agent:
    """Get the appropriate agent variant for a user based on A/B test."""
    # Determine which variant to show this user
    variant = await ab_test.get_user_variant(user_id, test_id)
    
    # Record exposure to this variant
    await ab_test.record_exposure(test_id, user_id, variant)
    
    # Load the agent variant
    agent_id, version = variant.split(":")
    agent = await version_manager.load_version(agent_id, version)
    
    return agent

# Example usage
user_id = "usr_425b3c"
agent = await get_ab_test_agent(user_id, test_id)
runner = Runner()
result = await runner.run(agent, "What would you recommend for stress relief?")

# Record conversion if applicable
if user_added_product_to_cart:
    await ab_test.record_conversion(test_id, user_id)

Training Infrastructure

The MOOD MNKY training system uses specialized infrastructure for different training approaches:
  • Prompt Engineering
  • Fine-tuning
  • Evaluation
┌─────────────────────────┐
│   Prompt Lab Platform   │
├─────────────────────────┤
│ - Version control       │
│ - A/B testing framework │
│ - Performance analytics │
│ - Collaborative editing │
│ - Template management   │
└─────────────────────────┘

Continuous Improvement Process

The MOOD MNKY agent system follows a structured improvement cycle:
1

Data Collection

Gathering user interactions, feedback, and performance metrics from production environments to identify improvement opportunities
2

Analysis & Prioritization

Analyzing collected data to identify patterns, issues, and high-impact improvement areas, then prioritizing them based on business impact
3

Hypothesis Development

Formulating specific improvement hypotheses with expected outcomes and measurement approaches
4

Implementation

Implementing improvements through prompt engineering, fine-tuning, or system modifications
5

Evaluation

Testing improvements against benchmarks and through A/B testing to validate hypotheses
6

Deployment

Rolling out validated improvements to production with appropriate monitoring
7

Monitoring

Continuously tracking performance to ensure improvements maintain effectiveness over time

Feedback Collection

The system collects various forms of feedback to guide training and improvement:

Explicit User Feedback

Direct feedback from users about their experience:
  • Ratings and reviews
  • Feature requests
  • Error reports
  • Satisfaction surveys
  • Support tickets

Implicit Behavioral Signals

Observed patterns in user behavior:
  • Conversation completion rates
  • Follow-up question frequency
  • Task success indicators
  • Engagement metrics
  • Repeat usage patterns

Human Evaluation

Expert assessment of agent performance:
  • Accuracy verification
  • Response quality scoring
  • Brand alignment review
  • Safety evaluation
  • Improvement suggestions

System Metrics

Technical performance indicators:
  • Response latency
  • Error rates
  • Token usage efficiency
  • Completion rates
  • API performance

Best Practices for Agent Training

Data Quality

  • Representative Sampling: Ensure training data covers the full range of expected use cases
  • Balanced Coverage: Maintain appropriate distribution across different tasks and scenarios
  • Quality Control: Implement rigorous review processes for training data
  • Continuous Enrichment: Regularly update datasets with new examples and edge cases
  • Diversity Consideration: Include diverse perspectives and language patterns

Evaluation Design

  • Comprehensive Benchmarks: Create benchmarks that cover all critical capabilities
  • Real-world Alignment: Design evaluation scenarios that reflect actual usage
  • Objective Metrics: Define clear, measurable criteria for success
  • Human-in-the-loop: Combine automated evaluation with human assessment
  • Progressive Standards: Gradually increase quality thresholds as capabilities improve

Deployment Strategy

  • Staged Rollout: Use progressive deployment to limit risk
  • Rollback Readiness: Maintain capability to quickly revert to previous versions
  • Monitoring Plan: Define key metrics to watch during and after deployment
  • Feedback Mechanisms: Implement channels for collecting user feedback on changes
  • Documentation: Maintain clear records of changes and their expected effects

Continuous Improvement

  • Regular Review Cycles: Establish scheduled reviews of agent performance
  • Targeted Improvements: Focus on specific capabilities rather than general changes
  • Impact Measurement: Quantify the effect of each improvement
  • Learning Documentation: Maintain knowledge base of what works and what doesn’t
  • Cross-functional Input: Incorporate perspectives from multiple stakeholders

Additional Resources