CROSSSPEECH: SPEAKER-INDEPENDENT ACOUSTIC REPRESENTATION FOR CROSS-LINGUAL SPEECH SYNTHESIS

Ji-Hoon Kim, Hong-Sun Yang, Yoon-Cheol Ju, Il-Hwan Kim, Byeong-Yeol Kim

ABSTRACT

While recent text-to-speech (TTS) systems have made remarkable strides toward human-level quality, the performance of cross-lingual TTS lags behind that of intra-lingual TTS. This gap is mainly rooted from the speaker-language entanglement problem in cross-lingual TTS. In this paper, we propose CrossSpeech which improves the quality of cross-lingual speech by effectively disentangling speaker and language information in the level of acoustic feature space. Specifically, CrossSpeech decomposes the speech generation pipeline into the speaker-independent generator (SIG) and speaker-dependent generator (SDG). The SIG produces the speaker-independent acoustic representation which is not biased to specific speaker distributions. On the other hand, the SDG models speaker-dependent speech variation that characterizes speaker attributes. By handling each information separately, CrossSpeech can obtain disentangled speaker and linguistic representations. From the experiments, we verify that CrossSpeech achieves significant improvements in cross-lingual TTS, especially in terms of speaker similarity to the target speaker.


DISENTANGLEMENT




We synthesized the speech from English texts with speakers in each language
(i.e. English, Chinese, Korean).

Choose speakers: 

English speakers



Script: In being comparatively modern.

Speaker-independent representation

Speaker-dependent representation

The final output




Script: Has never been surpassed.

Speaker-independent representation

Speaker-dependent representation

The final output




AUDIO SAMPLES

Cross-lingual

English text + Chinese speaker

Script: That depends on the type of medical examination, but basically you have to go to a hospital.

Target speaker

FastPitch

Y. Zhang et al.

D. Xin et al.

SANE-TTS

CrossSpeech


English text + Korean speaker

Script: But you will never be diagnosed with a lifestyle related disease at a clinic.

Target speaker

FastPitch

Y. Zhang et al.

D. Xin et al.

SANE-TTS

CrossSpeech


Chinese text + English speaker

Script: 下午三点钟,“考生”们各就各位后,接听正式开始。

Target speaker

FastPitch

Y. Zhang et al.

D. Xin et al.

SANE-TTS

CrossSpeech


Chinese text + Korean speaker

Script: 我就跑啊,往山东跑,往新疆跑。

Target speaker

FastPitch

Y. Zhang et al.

D. Xin et al.

SANE-TTS

CrossSpeech


Korean text + English speaker

Script: 더불어 아일랜드 사회의 변화의 흐름이 이어졌다.

Target speaker

FastPitch

Y. Zhang et al.

D. Xin et al.

SANE-TTS

CrossSpeech


Korean text + Chinese speaker

Script: 그를 둘러싼 흥미진진한 베일이 곧 걷혀질 참이다.

Target speaker

FastPitch

Y. Zhang et al.

D. Xin et al.

SANE-TTS

CrossSpeech



Intra-lingual

English text + English speaker

Script: Fourteen sixty-nine, fourteen seventy.

Ground truth

Vocoded

FastPitch

Y. Zhang et al.

D. Xin et al.

SANE-TTS

CrossSpeech


Chinese text + Chinese speaker

Script: 几周后,贝蒂确诊患有乳腺癌。

Ground truth

Vocoded

FastPitch

Y. Zhang et al.

D. Xin et al.

SANE-TTS

CrossSpeech


Korean text + Korean speaker

Script: 씻어 건져낸다.

Ground truth

Vocoded

FastPitch

Y. Zhang et al.

D. Xin et al.

SANE-TTS

CrossSpeech

ABLATION STUDY

Cross-lingual

English text + Chinese speaker

Script: What do you do in a general health checkup?

Target speaker

CrossSpeech

w/o M-DSLN

w/o $\mathcal{L}_{sgr}$

w/o SIP

w/o SDP

w/o Res


English text + Korean speaker

Script: I was panicking last night because I wasn't sure what to do.

Target speaker

CrossSpeech

w/o M-DSLN

w/o $\mathcal{L}_{sgr}$

w/o SIP

w/o SDP

w/o Res


Chinese text + English speaker

Script: 纸巾纸一般具有湿韧强度。

Target speaker

CrossSpeech

w/o M-DSLN

w/o $\mathcal{L}_{sgr}$

w/o SIP

w/o SDP

w/o Res


Chinese text + Korean speaker

Script: 山头叫包家岩。

Target speaker

CrossSpeech

w/o M-DSLN

w/o $\mathcal{L}_{sgr}$

w/o SIP

w/o SDP

w/o Res


Korean text + English speaker

Script: 배꽃 떨어질 때

Target speaker

CrossSpeech

w/o M-DSLN

w/o $\mathcal{L}_{sgr}$

w/o SIP

w/o SDP

w/o Res


Korean text + Chinese speaker

Script: 트윈 폴리오라는 그룹이 있었다.

Target speaker

CrossSpeech

w/o M-DSLN

w/o $\mathcal{L}_{sgr}$

w/o SIP

w/o SDP

w/o Res



Intra-lingual

English text + English speaker

Script: Produced the block books, which were the immediate predecessors of the true printed book.

Ground truth

CrossSpeech

w/o M-DSLN

w/o $\mathcal{L}_{sgr}$

w/o SIP

w/o SDP

w/o Res


Chinese text + Chinese speaker

Script: 下了车,杨花扑面,脸上痒痒的。

Ground truth

CrossSpeech

w/o M-DSLN

w/o $\mathcal{L}_{sgr}$

w/o SIP

w/o SDP

w/o Res


Korean text + Korean speaker

Script: 넥타이, 벨트, 구두, 다 명품이더라 이러는데

Ground truth

CrossSpeech

w/o M-DSLN

w/o $\mathcal{L}_{sgr}$

w/o SIP

w/o SDP

w/o Res