Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

A python implementation of “Self-Supervised Learning of Spatial Acoustic Representation with Cross-Channel Signal Reconstruction and Multi-Channel Conformer” [TASLP 2024]

License

Notifications You must be signed in to change notification settings

Audio-WestlakeU/SAR-SSL

Open more actions menu

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

44 Commits
 
 
 
 
 
 
 
 

Repository files navigation

SAR-SSL

A python implementation of “Self-Supervised Learning of Spatial Acoustic Representation with Cross-Channel Signal Reconstruction and Multi-Channel Conformer”, IEEE/ACM Transactions on Audio, Speech, and Language Processing (TASLP), 2024.

  • Contributions
    • Self-supervised learning of spatial acoustic representation (SSL-SAR)

      • first self-supervised learning method in spatial acoustic representation learning and multi-channel audio signal processing
      • designs cross-channel signal reconstruction pretext task to learn the spatial acoustic and the spectral pattern information
      • learns useful knowledge that can be transferred to the spatial acoustics-related tasks
    • Multi-channel audio Conformer (MC-Conformer)

      • unified architecture for both the pretext and downstream tasks
      • learns the local and global properties of spatial acoustics present in the time-frequency domain
      • boosts the performance of both pretext and downstream tasks

Datasets

  • Source signals: from WSJ0 database
  • Simulated RIRs: generated by gpuRIR toolbox
  • Simulated noise: generated by arbitrary noise field generator
  • Real-world RIRs or microphone signals: from MIR, MeshRIR, DCASE, dEchorate, BUTReverb, ACE, LOCATA, MC-WSJ-AV, LibriCSS, AMIMeeting, AISHELL-4, AliMeeting, RealMAN databases
    Datasets #Room Microphone Array #Mic. Pair #Room x #Source position x #Array position Noise Type
    MIR 3 Three 8-channel linear arrays 60 3 x 26 x 1 W/o
    MeshRIR 1 441 microphones 8874 1 x 32 x 1 W/o
    DCASE 9 A 4-channel tetrahedral array (EM32) 3 38530 Ambience
    dEchorate 11 Six 5-channel linear arrays 48 11 x 3 x 1 Ambience, babble, white
    BUTReverb 9 An 8-channel spherical array 28 51 Ambience
    ACE 7 A 2-channel array (Chromebook), 433 7 x 1 x 2 Ambience, babble, fan
    a 3-channel right-angled triangle array (Mobile),
    an 8-channel linear array (Lin8Ch),
    a 32-channel spherical array (EM32)
    LOCATA 1 A 15-channel linear array (DICIT), 492 Moving/static Ambience
    a 12-channel robot array (Robot head),
    a 32-channel spherical array (Eigenmike)
    MC- WSJ-AV 3 Two 8-channel linear arrays
    LibriCSS 1 A 7-channel circular array
    AMIMeeting 3 A 8-channel circular array
    AISHELL-4 10 A 8-channel circular array
    AliMeeting 21 A 8-channel circular array
    RealMAN 32 A 32-channel high-precision array

Quick start

Version update

  • code: 202407: the results are testing (to be updated).
  • code_v1: 202402, the results are the same as the paper.

Data generation

1. Download datasets to folders according to the following dictionary

.-SAR-SSL
| .-code
| .-data
| .-exp
.-data
  .-SrcSig
  | .-wsj0
  |   .-dt
  |   .-et
  |   .-tr
  .-RIR
  | .-Mesh
  | | .-S32-M441_npy
  | .-MIRDB
  | | .-Impulse_response_Acoustic_Lab_Bar-Ilan_University
  | .-DCASE
  | | .-TAU-SRIR_DB
  | | .-TAU-SNoise_DB
  | .-dEchorate
  | | .-dEchorate_database.csv
  | | .-dEchorate_rir.h5
  | | .-dEchorate_annotations.h5
  | | .-dEchorate_noise_gzip7.hdf5
  | | .-dEchorate_babble_gzip7.hdf5
  | | .-dEchorate_silence_gzip7.hdf5
  | .-BUTReverb
  | | .-RIRs
  | .-ACE
  |   .-RIRN
  |   .-Data
  .-MicSig
    .-LOCATA
      .-dev
      .-eval
    .- MC_WSJ_AV
    .- LibriCSS
    .- AMIMeeting
    .- AISHELL-4
    .- AliMeeting
    .- RealMAN

2. Generate room impulse responses or microphone signals

  • Data for simulated experimets

    • pre-training
      python gen_simu.py --mode sig --stage pretrain --data_num 512000 --src_dir ../../../data/SrcSig/wsj0 --save_to ../../data/MicSig/simu --gpus [0,1]
      python gen_simu.py --mode sig --stage preval --data_num 4000 --src_dir ../../../data/SrcSig/wsj0 --save_to ../../data/MicSig/simu --gpus [0]
      python gen_simu.py --mode sig --stage pretest --data_num 4000 --src_dir ../../../data/SrcSig/wsj0 --save_to ../../data/MicSig/simu --gpus [0]
      
    • some test instances
      python gen_simu.py --mode sig --stage pretest_ins_T1000 --data_num 10 --room_sz_range [[5,10],[3,6],[2.5,3]] --T60_range [1.0,1.0] --snr_range [20,20] --src_dir ../../../data/SrcSig/wsj0 --save_to ../../data/MicSig/simu --gpus [0]
      
    • downstream training
      python gen_simu_certain_room.py --mode sig --stage train --room_num 1000 --sig_num_each_rir 2 --src_dir ../../../data/SrcSig/wsj0 --save_to ../../data/MicSig/simu_ds 
      python gen_simu_certain_room.py --mode sig --stage val --room_num 20 --sig_num_each_rir 1 --src_dir ../../../data/SrcSig/wsj0 --save_to ../../data/MicSig/simu_ds 
      python gen_simu_certain_room.py --mode sig --stage test --room_num 20 --sig_num_each_rir 4 --src_dir ../../../data/SrcSig/wsj0 --save_to ../../data/MicSig/simu_ds 
      
  • Data for real-world experimets

    • real-wolrld RIR and noise signals
      python gen_real_rir.py --dataset DCASE dEchorate BUTReverb ACE --data_type rir noise --read_dir ../../../data/RIR --save_dir ../../data/RIR/real
      python gen_real_rir.py --dataset Mesh MIR --data_type rir --read_dir ../../../data/RIR --save_dir ../../data/RIR/real
      
    • microphone signals for pre-training with selected RIRs and noise signals
      python gen_sig_from_real_rir.py --stage pretrain --dataset Mesh MIR DCASE dEchorate BUTReverb ACE --src_dir ../../../data/SrcSig/wsj0 --rir_dir ../../../data/RIR/real --save_dir ../../data/MicSig/real 
      python gen_sig_from_real_rir.py --stage preval --dataset DCASE BUTReverb --src_dir ../../../data/SrcSig/wsj0 --rir_dir ../../../data/RIR/real --save_dir ../../data/MicSig/real  
      
    • LOCATA microphone signals for downstream training (TDOA estimation)
      python gen_LOCATA.py --stage train --save-to../../data/MicSig/real_ds_locata
      python gen_LOCATA.py --stage val --save-to../../data/MicSig/real_ds_locata
      python gen_LOCATA.py --stage test --save-to../../data/MicSig/real_ds_locata
      
    • additional RIRs for downstream training
      python gen_simu_certain_room.py --mode rir --stage train --room_num 1000 --save_to ../../data/RIR/simu 
      

Pretext Task

1. Preparation

  • Install: numpy, scipy, soundfile, gpuRIR, etc.

2. Training

  • Simulated experiments

    • Pretext task: pre-training

      python run_pretrain.py --pretrain --simu-exp --gpu-id 0,
      
    • Pretext task: evaluation

      # * denotes the time version of pre-training model 
      # --test-mode all: all or ins
      python run_pretrain.py --test --simu-exp --time * --test-mode all --gpu-id 0, 
      
    • Downstream task: training

      # --ds-nsimroom: 2,4,8,16,32,64,128 or 256
      # --ds-task: TDOA DRR T60 C50 or ABS
      # --ds-trainmode: finetune, scratchLOW or lineareval
      python run_downstream.py --ds-train --ds-trainmode finetune --simu-exp --ds-nsimroom 8 --ds-task TDOA --time * --gpu-id 0, 
      
      Stage Trials nRooms nRIRs/Room nSrcSig/RIR nMicSig
      train x16 2 50 2 200
      x8 4 50 2 400
      x4 8 50 2 800
      x2 16 50 2 1600
      x1 32 50 2 3200
      x1 64 50 2 6400
      x1 128 50 2 12800
      x1 256 50 2 25600
      val - 20 50 1 1000
      test - 20 50 4 4000
    • Downstream task: evaluation

      # --ds-nsimroom: 2, 4, 8, 16, 32, 64, 128 or 256
      # --ds-task: TDOA, DRR, T60, C50, or ABS
      # --ds-trainmode: finetune, scratchLOW or lineareval
      # --test_mode: cal_metric, cal_metric_wo_info or vis_embed
      python run_downstream.py --ds-test --test_mode cal_metric --ds-trainmode finetune --simu-exp --ds-nsimroom 8 --ds-task TDOA --time * --gpu-id 0, 
      
  • Real-world experiments

    • Pretext task:pre-training

      when using real-world data, first train on simulated data with a default cosine-decay learing rate (initialized with 0.001), and then finetune on real-world data with a learning rate 0.0001.

      python run_pretrain.py --pretrain --gpu-id 0, 
      
    • Downstream task: training

      # --ds-task: TDOA DRR T60 C50 or ABS
      # --ds-trainmode: finetune, scratchLOW or lineareval
      # --ds-real-sim-ratio = 1 1, 1 0 or 0 1
      python run_downstream.py --ds-train --ds-trainmode finetune --ds-real-sim-ratio 1 0 --ds-task TDOA --time * --gpu-id 0, 
      python run_downstream.py --ds-train --ds-trainmode scratchLOW --ds-real-sim-ratio 1 0 --ds-task TDOA --time * --gpu-id 0, 
      
    • Downstream task: read downstream results (MAEs of TDOA, DRR, T60, C50, SNR, ABS estimation) from saved mat files

      python read_result_from_downstream_matfile.py --time *
      python read_lossmetric_simdata_from_logfile.py
      python read_lossmetric_realdata_from_logfile.py
      
  • Trained models

    • pretext task
      • best_model.tar
    • downstream task
      • ensemble_model.tar

Others

If OSError: [Errno 24] Too many open files occurs, input the following at the command line

ulimit -n 2048

Citation

If you find our work useful in your research, please consider citing:

@InProceedings{yang2024sarssl,
    Author = "Bing Yang and Xiaofei Li",
    Title = "Self-Supervised Learning of Spatial Acoustic Representation with Cross-Channel Signal Reconstruction and Multi-Channel Conformer",
    Journal = "IEEE/ACM Transactions on Audio, Speech, and Language Processing (TASLP)",
    Volume = "32",	
    Number = "",
    Pages = "4211-4225",
    Year = "2024"}

Licence

MIT

About

A python implementation of “Self-Supervised Learning of Spatial Acoustic Representation with Cross-Channel Signal Reconstruction and Multi-Channel Conformer” [TASLP 2024]

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

Morty Proxy This is a proxified and sanitized view of the page, visit original site.