Question in the inference

the required spectrogram form is like [N,C,W].

spectrogram = # get your hands on a spectrogram in [N,C,W] format

could you please explain these three dimensions?

I use the code from this repo: https://github.com/CorentinJ/Real-Time-Voice-Cloning to produce the mel spectrogram and use diffwave as the vocoder. But I only get the audio full of noises.

generate mel spectrogram

specs = synthesizer.synthesize_spectrograms(texts, embeds) #len(specs) == 1
spec = specs[0] #spec numpy.array, float32, shape(80, 314)
spec = torch.tensor(spec)

Generating the waveform

diffwave_dir = "/hdd/haoran_project/diffwave-master/pretrained_models/diffwave-ljspeech-22kHz-1000578.pt"
generated_wav, sample_rate = diffwave_predict(spec, diffwave_dir, fast_sampling=True)

Save it on the disk

filename = "results/diffwave_Elon.wav"
print(generated_wav.dtype, " ",generated_wav.shape) # torch.float32 torch.Size([1, 87040])
torchaudio.save(filename, generated_wav.cpu(), sample_rate=sample_rate)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question in the inference #51

generate mel spectrogram

Generating the waveform

Save it on the disk

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Search code, repositories, users, issues, pull requests...

Question in the inference #51

Description

generate mel spectrogram

Generating the waveform

Save it on the disk

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions