Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

EngineeringAI-LAB/3DXTalker

Open more actions menu

Repository files navigation

3DXTalker: Unifying Identity, Lip Sync, Emotion, and Spatial Dynamics in Expressive 3D Talking Avatars

PyTorch arXiv Project Page HF Dataset

teaser

3DXTalker generates identity-consistent, expressive 3D talking avatars from a single reference image and speech audio, achieving accurate lip synchronization, expressive emotion control, and natural head-pose dynamics. It achieves expressive facial animation through data-curated identity modeling, audio-rich representations, and spatial dynamics controllability. By introducing frame-wise amplitude and emotional cues beyond standard speech embeddings, 3DXTalker delivers superior lip synchronization and nuanced expression modulation. Built on a flow-matching transformer architecture, it enables natural head-pose motion generation while supporting stylized control, integrating lip synchronization, emotional expression, and head-pose dynamics within a unified framework.

TODO

  • Release the 3DTalking benchmark dataset
    • Release the raw dataset
    • Release the processed dataset
  • Release the data processing code
  • Release the training and inference code
  • Release the pretrained models

Installation

  • Python 3.10
  • Pytorch 2.2.2
  • CUDA 12.1
  • Pytorch3d 0.7.7
conda create -n env_3DXTalker python==3.10
conda activate env_3DXTalker
pip install -r requirements.txt

For some people the compilation fails during requirements install and works after. Try running the following separately:

pip install "git+https://github.com/facebookresearch/pytorch3d.git@v0.7.7"

Download Pretrained Audio Encoders

  1. Download emotion2vec_plus_base model and place it in ./pretrained_models/:

    # Create directory
    mkdir -p pretrained_models/emotion2vec_plus_base
    
    # Option 1: Using git-lfs (recommended)
    cd pretrained_models
    git lfs install
    git clone https://huggingface.co/iic/emotion2vec_plus_base
    
    # Option 2: Manual download from https://huggingface.co/iic/emotion2vec_plus_base
    # Download all files to ./pretrained_models/emotion2vec_plus_base/
  2. Download microsoft/wavlm-base-plus (audio encoder):

    # Option 1: Auto-download on first run (recommended)
    # The model will be automatically downloaded from HuggingFace when you run training
    
    # Option 2: Pre-download manually
    cd pretrained_models
    git lfs install
    git clone https://huggingface.co/microsoft/wavlm-base-plus
    
    # Then update config/default_config.yaml:
    # audio_encoder_repo: './pretrained_models/wavlm-base-plus'

    Expected directory structure:

    pretrained_models/
    ├── emotion2vec_plus_base/
    │   ├── config.json
    │   ├── pytorch_model.bin
    │   └── ...
    └── wavlm-base-plus/          # Optional (auto-downloads if not present)
        ├── config.json
        ├── pytorch_model.bin
        └── ...
    

Data Preparation and Preprocess

  1. Download raw video datasets following these links: V0-GRID; v1-RAVDESS; V2-MEAD; V3-VoxCeleb2; V4-HDTF; V5-Celebv-HQ

    If you don't want to process the data manually, we also provide processed data at Hugging Face.

  2. Run data curation (duration, noise, language, sync, resolution normalization).

  • Edit raw_video_dir in data_prepare/data_curation_pipeline.py to your raw video folder.
cd data_prepare
python data_curation_pipeline.py

Output will be in data_prepare/final_curated_videos/.

  1. Rename videos for dataset indexing.
  • Edit dataset_name, input_dir, and output_dir in data_prepare/rename.py if needed.
  • By default it expects input at data_prepare/Scaled_videos and outputs to data_prepare/Renamed_videos.
cd data_prepare
python rename.py
  1. Download EMOCA-related assets (models and FLAME files).
bash gdl_apps/EMOCA/demos/download_assets.sh
  1. Run EMOCA reconstruction to extract FLAME parameters.
  • Edit data_root_dir and dataset_name in gdl_apps/EMOCA/demos/my_recons_video.py.
  • data_root_dir should contain <dataset_name>/all_videos_path.txt.
python gdl_apps/EMOCA/demos/my_recons_video.py \
  --dataset_name VoxCeleb2 \
  --output_folder video_output \
  --model_name EMOCA_v2_lr_mse_20
  1. Data structures are provided in DATASET_STRUCTURE.md

Citation

@misc{wang20263dxtalkerunifyingidentitylip,
      title={3DXTalker: Unifying Identity, Lip Sync, Emotion, and Spatial Dynamics in Expressive 3D Talking Avatars}, 
      author={Zhongju Wang and Zhenhong Sun and Beier Wang and Yifu Wang and Daoyi Dong and Huadong Mo and Hongdong Li},
      year={2026},
      eprint={2602.10516},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2602.10516}, 
}

About

Official repository for 3DXTalker: An Integrated Framework for Expressive 3D Talking Avatars

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

Morty Proxy This is a proxified and sanitized view of the page, visit original site.