3DXTalker: Unifying Identity, Lip Sync, Emotion, and Spatial Dynamics in Expressive 3D Talking Avatars

3DXTalker generates identity-consistent, expressive 3D talking avatars from a single reference image and speech audio, achieving accurate lip synchronization, expressive emotion control, and natural head-pose dynamics. It achieves expressive facial animation through data-curated identity modeling, audio-rich representations, and spatial dynamics controllability. By introducing frame-wise amplitude and emotional cues beyond standard speech embeddings, 3DXTalker delivers superior lip synchronization and nuanced expression modulation. Built on a flow-matching transformer architecture, it enables natural head-pose motion generation while supporting stylized control, integrating lip synchronization, emotional expression, and head-pose dynamics within a unified framework.

TODO

Release the 3DTalking benchmark dataset
- Release the raw dataset
- Release the processed dataset
Release the data processing code
Release the training and inference code
Release the pretrained models

Installation

Python 3.10
Pytorch 2.2.2
CUDA 12.1
Pytorch3d 0.7.7

conda create -n env_3DXTalker python==3.10
conda activate env_3DXTalker
pip install -r requirements.txt

For some people the compilation fails during requirements install and works after. Try running the following separately:

pip install "git+https://github.com/facebookresearch/pytorch3d.git@v0.7.7"

Download Pretrained Audio Encoders

Download emotion2vec_plus_base model and place it in ./pretrained_models/:

# Create directory
mkdir -p pretrained_models/emotion2vec_plus_base

# Option 1: Using git-lfs (recommended)
cd pretrained_models
git lfs install
git clone https://huggingface.co/iic/emotion2vec_plus_base

# Option 2: Manual download from https://huggingface.co/iic/emotion2vec_plus_base
# Download all files to ./pretrained_models/emotion2vec_plus_base/

Download microsoft/wavlm-base-plus (audio encoder):

# Option 1: Auto-download on first run (recommended)
# The model will be automatically downloaded from HuggingFace when you run training

# Option 2: Pre-download manually
cd pretrained_models
git lfs install
git clone https://huggingface.co/microsoft/wavlm-base-plus

# Then update config/default_config.yaml:
# audio_encoder_repo: './pretrained_models/wavlm-base-plus'

Expected directory structure:

pretrained_models/
├── emotion2vec_plus_base/
│   ├── config.json
│   ├── pytorch_model.bin
│   └── ...
└── wavlm-base-plus/          # Optional (auto-downloads if not present)
    ├── config.json
    ├── pytorch_model.bin
    └── ...

Data Preparation and Preprocess

Download raw video datasets following these links: V0-GRID; v1-RAVDESS; V2-MEAD; V3-VoxCeleb2; V4-HDTF; V5-Celebv-HQ

If you don't want to process the data manually, we also provide processed data at Hugging Face.
Run data curation (duration, noise, language, sync, resolution normalization).

Edit raw_video_dir in data_prepare/data_curation_pipeline.py to your raw video folder.

cd data_prepare
python data_curation_pipeline.py

Output will be in data_prepare/final_curated_videos/.

Rename videos for dataset indexing.

Edit dataset_name, input_dir, and output_dir in data_prepare/rename.py if needed.
By default it expects input at data_prepare/Scaled_videos and outputs to data_prepare/Renamed_videos.

cd data_prepare
python rename.py

Download EMOCA-related assets (models and FLAME files).

bash gdl_apps/EMOCA/demos/download_assets.sh

Run EMOCA reconstruction to extract FLAME parameters.

Edit data_root_dir and dataset_name in gdl_apps/EMOCA/demos/my_recons_video.py.
data_root_dir should contain <dataset_name>/all_videos_path.txt.

python gdl_apps/EMOCA/demos/my_recons_video.py \
  --dataset_name VoxCeleb2 \
  --output_folder video_output \
  --model_name EMOCA_v2_lr_mse_20

Data structures are provided in DATASET_STRUCTURE.md

Citation

@misc{wang20263dxtalkerunifyingidentitylip,
      title={3DXTalker: Unifying Identity, Lip Sync, Emotion, and Spatial Dynamics in Expressive 3D Talking Avatars}, 
      author={Zhongju Wang and Zhenhong Sun and Beier Wang and Yifu Wang and Daoyi Dong and Huadong Mo and Hongdong Li},
      year={2026},
      eprint={2602.10516},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2602.10516}, 
}

Name	Name	Last commit message	Last commit date
Latest commit History 32 Commits 32 Commits
accelerate_configs	accelerate_configs
data_prepare	data_prepare
data_provider	data_provider
decalib	decalib
gdl	gdl
3DXTalker.png	3DXTalker.png
LICENSE	LICENSE
README.md	README.md
requirements.txt	requirements.txt
teaser.png	teaser.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

3DXTalker: Unifying Identity, Lip Sync, Emotion, and Spatial Dynamics in Expressive 3D Talking Avatars

TODO

Installation

Download Pretrained Audio Encoders

Data Preparation and Preprocess

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Search code, repositories, users, issues, pull requests...

Folders and files

Latest commit

History

Repository files navigation

3DXTalker: Unifying Identity, Lip Sync, Emotion, and Spatial Dynamics in Expressive 3D Talking Avatars

TODO

Installation

Download Pretrained Audio Encoders

Data Preparation and Preprocess

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages