@happyvertical/smrt-voice

TTS voice profile management with AI voice design, audio cloning from samples, and word-level timing for lip-sync.

v0.20.44Voice DesignCloningLip-Sync Timings

Overview

smrt-voice manages voice profiles for AI-powered text-to-speech synthesis. Profiles can be created via AI voice design (from a natural language prompt) or by cloning from audio samples. Generated TTS output includes word-level timing data for lip-sync integration.

Installation

bash
npm install @happyvertical/smrt-voice

Quick Start

typescript
import { VoiceProfile, VoiceSample, VoiceOutput } from '@happyvertical/smrt-voice';

// Mode 1: Voice design -- AI generates from prompt
const designed = new VoiceProfile({
  name: 'News Anchor',
  language: 'en-US',
  gender: 'male',
  designPrompt: 'Warm, authoritative male voice with clear enunciation',
  defaultSpeed: 1.0,   // 0.5 - 2.0
  defaultPitch: 0,     // -20 to 20 semitones
});
await designed.save();

// Mode 2: Voice cloning -- replicate from audio sample(s)
const cloned = new VoiceProfile({
  name: 'Custom Voice',
  language: 'en-US',
  sampleAssetId: 'asset-123',
});
await cloned.save();

// Add training samples (minimum 3 seconds, quality != low)
const sample = new VoiceSample({
  voiceProfileId: cloned.id,
  assetId: 'asset-456',
  duration: 5.2,
  transcription: 'Hello, this is a test recording for voice cloning.',
  quality: 'high',
  sampleRate: 48000,
  format: 'wav',
  isPrimary: true,
});
await sample.save();

// TTS output with word-level timing for lip-sync
const output = new VoiceOutput({
  voiceProfileId: designed.id,
  sourceText: 'Welcome to the evening news.',
  audioAssetId: 'asset-789',
  duration: 2.8,
  wordTimings: [
    { word: 'Welcome', start: 0.0, end: 0.4 },
    { word: 'to', start: 0.4, end: 0.5 },
    { word: 'the', start: 0.5, end: 0.6 },
    { word: 'evening', start: 0.6, end: 1.0 },
    { word: 'news', start: 1.0, end: 1.3 },
  ],
});
// Look up which word is spoken at a timestamp
output.getWordAtTime(0.7); // { word: 'evening', start: 0.6, end: 1.0 }

Core Models

VoiceProfile

typescript
class VoiceProfile extends SmrtObject {
  name: string
  language: string
  gender: 'male' | 'female' | 'neutral'
  designPrompt?: string       // AI voice design (mutually exclusive)
  sampleAssetId?: string      // Cloned from audio (mutually exclusive)
  defaultSpeed: number        // 0.5 - 2.0
  defaultPitch: number        // -20 to 20 semitones
  voiceData?: Record<string, any>  // Provider-specific (opaque)
  status: 'pending' | 'processing' | 'ready' | 'failed'

  get isCloned(): boolean
  get isDesigned(): boolean
  get isReady(): boolean
}

VoiceSample

typescript
class VoiceSample extends SmrtObject {
  voiceProfileId: string
  assetId: string
  duration: number            // Seconds
  transcription?: string
  quality: 'low' | 'medium' | 'high'
  sampleRate?: number
  format?: string
  isPrimary: boolean

  get meetsMinDuration(): boolean      // >= 3 seconds
  get isSuitableForCloning(): boolean  // >= 3 sec AND quality != low
}

VoiceOutput (extends Content)

typescript
class VoiceOutput extends Content {
  voiceProfileId: string
  sourceText: string
  audioAssetId: string
  duration: number
  wordTimings: WordTiming[]   // [{ word, start, end }] in seconds
  audioMetadata?: VoiceOutputMetadata

  get wordCount(): number
  get wordsPerSecond(): number
  getWordAtTime(seconds: number): WordTiming | undefined
}

Best Practices

DOs

  • Use designPrompt XOR sampleAssetId (mutually exclusive modes)
  • Check isSuitableForCloning before using samples (3+ sec, not low quality)
  • Use getWordAtTime() for precise lip-sync alignment
  • Check isReady before using a profile for TTS generation
  • Set tenantId: null for global/default voice profiles

DON'Ts

  • Don't set both designPrompt and sampleAssetId on the same profile
  • Don't expect the framework to generate wordTimings (populated by external TTS provider)
  • Don't rely on the 3-second minimum being enforced in the constructor (documented only)
  • Don't assume status transitions are enforced (manual status setting is possible)
  • Don't depend on a specific voiceData schema (provider-specific, opaque)

Related Modules