@happyvertical/smrt-voice
TTS voice profile management with AI voice design, audio cloning from samples, and word-level timing for lip-sync.
v0.20.44Voice DesignCloningLip-Sync Timings
Overview
smrt-voice manages voice profiles for AI-powered text-to-speech synthesis. Profiles can be created via AI voice design (from a natural language prompt) or by cloning from audio samples. Generated TTS output includes word-level timing data for lip-sync integration.
Installation
bash
npm install @happyvertical/smrt-voiceQuick Start
typescript
import { VoiceProfile, VoiceSample, VoiceOutput } from '@happyvertical/smrt-voice';
// Mode 1: Voice design -- AI generates from prompt
const designed = new VoiceProfile({
name: 'News Anchor',
language: 'en-US',
gender: 'male',
designPrompt: 'Warm, authoritative male voice with clear enunciation',
defaultSpeed: 1.0, // 0.5 - 2.0
defaultPitch: 0, // -20 to 20 semitones
});
await designed.save();
// Mode 2: Voice cloning -- replicate from audio sample(s)
const cloned = new VoiceProfile({
name: 'Custom Voice',
language: 'en-US',
sampleAssetId: 'asset-123',
});
await cloned.save();
// Add training samples (minimum 3 seconds, quality != low)
const sample = new VoiceSample({
voiceProfileId: cloned.id,
assetId: 'asset-456',
duration: 5.2,
transcription: 'Hello, this is a test recording for voice cloning.',
quality: 'high',
sampleRate: 48000,
format: 'wav',
isPrimary: true,
});
await sample.save();
// TTS output with word-level timing for lip-sync
const output = new VoiceOutput({
voiceProfileId: designed.id,
sourceText: 'Welcome to the evening news.',
audioAssetId: 'asset-789',
duration: 2.8,
wordTimings: [
{ word: 'Welcome', start: 0.0, end: 0.4 },
{ word: 'to', start: 0.4, end: 0.5 },
{ word: 'the', start: 0.5, end: 0.6 },
{ word: 'evening', start: 0.6, end: 1.0 },
{ word: 'news', start: 1.0, end: 1.3 },
],
});
// Look up which word is spoken at a timestamp
output.getWordAtTime(0.7); // { word: 'evening', start: 0.6, end: 1.0 }Core Models
VoiceProfile
typescript
class VoiceProfile extends SmrtObject {
name: string
language: string
gender: 'male' | 'female' | 'neutral'
designPrompt?: string // AI voice design (mutually exclusive)
sampleAssetId?: string // Cloned from audio (mutually exclusive)
defaultSpeed: number // 0.5 - 2.0
defaultPitch: number // -20 to 20 semitones
voiceData?: Record<string, any> // Provider-specific (opaque)
status: 'pending' | 'processing' | 'ready' | 'failed'
get isCloned(): boolean
get isDesigned(): boolean
get isReady(): boolean
}VoiceSample
typescript
class VoiceSample extends SmrtObject {
voiceProfileId: string
assetId: string
duration: number // Seconds
transcription?: string
quality: 'low' | 'medium' | 'high'
sampleRate?: number
format?: string
isPrimary: boolean
get meetsMinDuration(): boolean // >= 3 seconds
get isSuitableForCloning(): boolean // >= 3 sec AND quality != low
}VoiceOutput (extends Content)
typescript
class VoiceOutput extends Content {
voiceProfileId: string
sourceText: string
audioAssetId: string
duration: number
wordTimings: WordTiming[] // [{ word, start, end }] in seconds
audioMetadata?: VoiceOutputMetadata
get wordCount(): number
get wordsPerSecond(): number
getWordAtTime(seconds: number): WordTiming | undefined
}Best Practices
DOs
- Use
designPromptXORsampleAssetId(mutually exclusive modes) - Check
isSuitableForCloningbefore using samples (3+ sec, not low quality) - Use
getWordAtTime()for precise lip-sync alignment - Check
isReadybefore using a profile for TTS generation - Set
tenantId: nullfor global/default voice profiles
DON'Ts
- Don't set both
designPromptandsampleAssetIdon the same profile - Don't expect the framework to generate
wordTimings(populated by external TTS provider) - Don't rely on the 3-second minimum being enforced in the constructor (documented only)
- Don't assume status transitions are enforced (manual status setting is possible)
- Don't depend on a specific
voiceDataschema (provider-specific, opaque)