Projects

Selected engineering work across real-time voice agents, speech AI, production audio ML systems, GPU workers, backend services, deployed RAG project sites, and shipped Flutter products. Health Voice RAG opens the project site; the live demo is active for now and the site keeps recorded sessions and engineering notes for later.

Health Voice RAG Demo thumbnail

AI / RAG / Voice Assistant

Live demo active for now

Health Voice RAG Demo

A complete project website for a real-time health voice assistant. The live voice/text RAG demo is currently active, and the site remains useful without it through architecture notes, pipeline visibility, citations, recorded session replay, retrieval evaluation, reliability notes, and production-oriented engineering details.

LiveKitPythonNext.jsVercelAzure OpenAIGoogle GenAIHybrid RetrievalLanceDBONNX Runtime
Real Time Voice Assistant Platform thumbnail

Voice AI / Real-time Systems

NetEase YoudaoDeprecated / currently down

Real Time Voice Assistant Platform

Implemented server and worker components for real-time voice chat, translation, token generation, and assistant orchestration. Customized LiveKit integrations and unified multiple speech and LLM providers behind one real-time workflow.

LiveKitPythonFlaskAzure SpeechSpeech APIsLLM services
Source Separation Website thumbnail

Audio ML / Flutter Web / Backend

NetEase YoudaoDeprecated / currently down

Source Separation Website

Built a Flutter Web audio separation website with four separation options, including vocal/accompaniment separation, advanced instrument splitting, denoise, and cleaner training-data preparation. Built the backend as well with FastAPI middleware, WebSocket communication, GPU workers, and object-storage delivery.

Flutter WebFastAPIWebSocketRedisGPU inferenceObject StorageAudio Separation
Production Voice Conversion and Separation Stack thumbnail

Audio ML / Backend Infrastructure

NetEase Youdao

Production Voice Conversion and Separation Stack

Internal production system; beneficiary details are not public.

Developed service layers for singing voice conversion, zero-shot conversion from reference audio, voice cloning, source separation, loudness normalization, audio enhancement, GPU workers, and object storage uploads. Built PostgreSQL-backed task and job tracking for workers, statuses, worker registration, and job acquisition flows, while maintaining deployment workflows, API contracts, and operational tooling for production audio services.

PythonPyTorchFastAPIRedisPostgreSQLRabbitMQFFmpegZero-shot SVCSource SeparationLoudness NormalizationAudio Enhancement
Dubbing and Prompt TTS Data Pipeline thumbnail

Speech AI / Data Pipelines

NetEase Youdao

Dubbing and Prompt TTS Data Pipeline

Designed and deployed a full dubbing and prompt-TTS backend that coordinated media ingestion, audio separation, subtitle extraction/correction, LLM translation, zero-shot TTS, audio alignment, enhancement, final mixing, and object-storage delivery. The system used FastAPI for task APIs, PostgreSQL-backed async workers for long-running GPU/audio jobs, and internal Gradio tools for operation/testing rather than a public interface.

See more about the dubbing pipeline

The backend accepted social media URLs, uploaded files, and video-plus-SRT inputs through FastAPI. Each request became a PostgreSQL task with target languages, enabled pipeline flags, status, step outputs, failure reason, retry count, lease timing, and per-step timing metadata. Workers acquired jobs directly from Postgres using leased task rows, moved them through explicit statuses, refreshed leases during long GPU/API calls, and saved progress after every step.

The pipeline could download social media audio/video, fetch existing subtitles, or fall back to SRT extraction. For subtitle generation and repair, it integrated an ASR-based SRT extractor plus an SRT corrector that used the vocal audio to adjust subtitle timing, merge/split short segments, respect speaker similarity, and better align text with real speech. This covered diarization, segmentation, timestamp correction, and audio-aware subtitle cleanup before translation or TTS.

For translation, I built an LLM subtitle translator that chunked SRT files, supported many target languages, retried failed chunks, split difficult chunks into smaller pieces, preserved subtitle order, and validated model output. The translator checked line counts, timestamp structure, forbidden meta prefixes, no-dub markers, and target-language Unicode/codepoint ranges for languages like Japanese, Korean, and Thai so wrong-script or malformed outputs could be caught before downstream TTS.

The TTS pipeline generated zero-shot speech from reference vocal audio. It sliced reference audio per subtitle, dispatched line-level or batch TTS jobs to deployed TTS services, reconstructed a full timeline from generated lines, aligned output to subtitle slots, handled reruns for edited lines, uploaded line artifacts, and then ran post-TTS processing. Finalization mixed generated speech with accompaniment from the separation service, optionally muxed audio back into video, converted final audio, uploaded final SRT/audio/video outputs to object storage, and cached reusable artifacts by video, language, separation service, and duration limits.

FastAPIPostgreSQLAsync WorkersGradioLLM TranslationSRT CorrectionASRDiarizationSegmentationZero-shot TTSAudio SeparationObject Storage
SingUp GPU Voice Backend and Consumer Audio Releases thumbnail

Consumer Audio Apps / GPU Inference

NetEase Youdao

SingUp GPU Voice Backend and Consumer Audio Releases

Built GPU training and inference support for SingUp, covering custom voice training, AI cover generation, model download, and result callbacks. Implemented Redis and RabbitMQ worker pipelines for audio download, vocal cleaning, separation, RVC training, generated audio upload, status tracking, and multi-GPU deployment.

Related apps: SingUp, Clear AI Audio Filter, Easy Chord, AI Remix

PythonPyTorchFastAPIRedisRabbitMQGPU inferenceVoice trainingObject Storage
Igramo Jamb thumbnail

Flutter / Game / Real-Time Multiplayer / On-Device AI

Igramo Jamb

Built and shipped Igramo Jamb, a cross-platform Flutter implementation of a Serbian/Balkan dice scorecard game related to Yahtzee. The app supports offline local play, real-time online rooms, spectator mode, ELO ranking, player profiles, match history, English/Serbian localization, and ONNX-powered AI bot opponents trained with reinforcement learning. The released JAMB AI repository documents the training, evaluation, ONNX export, and selected bot models.

Product and AI details

The product layer combines Flutter Web and Flutter iOS with Firebase Authentication, Firestore, Realtime Database, Storage, Analytics, account flows, avatar uploads, game history, rankings, presence, inactivity handling, and compatibility-aware online sync.

The bot system was trained separately with custom Gymnasium environments and Maskable PPO, then exported from PyTorch to ONNX. The public AI subsystem release includes the v4-v6 environment lineage, pinned dependencies, experiment documentation, model card, Flutter integration notes, and selected v6 ONNX/SB3 artifacts.

Flutter WebFlutter iOSFirebase AuthenticationFirestoreRealtime DatabaseFirebase StorageFirebase AnalyticsOnline MultiplayerONNX RuntimeGame AIReinforcement Learning
ZipVoice Multilingual TTS Fine-Tuning thumbnail

Speech AI / TTS Fine-Tuning

CPU demo

ZipVoice Multilingual TTS Fine-Tuning

Fine-tuned ZipVoice for multilingual text-to-speech generation in Serbian and Arabic, with a complete training and inference pipeline, Hugging Face model deployment, and an interactive CPU-only Gradio demo for generating speech from text. The Serbian model used CLARIN speech datasets, while the Arabic model used Common Voice, ArVoice, and MGB2 Arabic data. Evaluation covered WER, CER, audio duration, and WavLM similarity; Serbian reported mean WER 0.17 and CER 0.10, while Arabic reported mean WER 0.14 and CER 0.05.

Model links

Serbian model: https://huggingface.co/karim1993/zipvoice-sr-finetuned

Arabic model: https://huggingface.co/karim1993/zipvoice-ar-finetuned

Demo space: https://huggingface.co/spaces/karim1993/zipvoice-multilingual-tts-demo

ZipVoicePyTorchHugging FaceGradioONNX-compatible InferenceSerbian TTSArabic TTSWER/CER EvaluationWavLM Similarity