Ji-Hoon Kim, Hong-Sun Yang, Yoon-Cheol Ju, Il-Hwan Kim, Byeong-Yeol Kim
While recent text-to-speech (TTS) systems have made remarkable strides toward human-level quality, the performance of cross-lingual TTS lags behind that of intra-lingual TTS. This gap is mainly rooted from the speaker-language entanglement problem in cross-lingual TTS. In this paper, we propose CrossSpeech which improves the quality of cross-lingual speech by effectively disentangling speaker and language information in the level of acoustic feature space. Specifically, CrossSpeech decomposes the speech generation pipeline into the speaker-independent generator (SIG) and speaker-dependent generator (SDG). The SIG produces the speaker-independent acoustic representation which is not biased to specific speaker distributions. On the other hand, the SDG models speaker-dependent speech variation that characterizes speaker attributes. By handling each information separately, CrossSpeech can obtain disentangled speaker and linguistic representations. From the experiments, we verify that CrossSpeech achieves significant improvements in cross-lingual TTS, especially in terms of speaker similarity to the target speaker.
Script: In being comparatively modern. |
||
---|---|---|
Speaker-independent representation |
Speaker-dependent representation |
The final output |
|
|
|
Script: Has never been surpassed. |
||
---|---|---|
Speaker-independent representation |
Speaker-dependent representation |
The final output |
|
|
|
English text + Chinese speaker
Script: That depends on the type of medical examination, but basically you have to go to a hospital. |
||||
---|---|---|---|---|
Target speaker |
||||
FastPitch |
Y. Zhang et al. |
D. Xin et al. |
SANE-TTS |
CrossSpeech |
English text + Korean speaker
Script: But you will never be diagnosed with a lifestyle related disease at a clinic. |
||||
---|---|---|---|---|
Target speaker |
||||
FastPitch |
Y. Zhang et al. |
D. Xin et al. |
SANE-TTS |
CrossSpeech |
Chinese text + English speaker
Script: 下午三点钟,“考生”们各就各位后,接听正式开始。 |
||||
---|---|---|---|---|
Target speaker |
||||
FastPitch |
Y. Zhang et al. |
D. Xin et al. |
SANE-TTS |
CrossSpeech |
Chinese text + Korean speaker
Script: 我就跑啊,往山东跑,往新疆跑。 |
||||
---|---|---|---|---|
Target speaker |
||||
FastPitch |
Y. Zhang et al. |
D. Xin et al. |
SANE-TTS |
CrossSpeech |
Korean text + English speaker
Script: 더불어 아일랜드 사회의 변화의 흐름이 이어졌다. |
||||
---|---|---|---|---|
Target speaker |
||||
FastPitch |
Y. Zhang et al. |
D. Xin et al. |
SANE-TTS |
CrossSpeech |
Korean text + Chinese speaker
Script: 그를 둘러싼 흥미진진한 베일이 곧 걷혀질 참이다. |
||||
---|---|---|---|---|
Target speaker |
||||
FastPitch |
Y. Zhang et al. |
D. Xin et al. |
SANE-TTS |
CrossSpeech |
English text + English speaker
Script: Fourteen sixty-nine, fourteen seventy. |
||||
---|---|---|---|---|
Ground truth |
Vocoded |
|||
FastPitch |
Y. Zhang et al. |
D. Xin et al. |
SANE-TTS |
CrossSpeech |
Chinese text + Chinese speaker
Script: 几周后,贝蒂确诊患有乳腺癌。 |
||||
---|---|---|---|---|
Ground truth |
Vocoded |
|||
FastPitch |
Y. Zhang et al. |
D. Xin et al. |
SANE-TTS |
CrossSpeech |
Korean text + Korean speaker
Script: 씻어 건져낸다. |
||||
---|---|---|---|---|
Ground truth |
Vocoded |
|||
FastPitch |
Y. Zhang et al. |
D. Xin et al. |
SANE-TTS |
CrossSpeech |
English text + Chinese speaker
Script: What do you do in a general health checkup? |
||
---|---|---|
Target speaker |
||
CrossSpeech |
w/o M-DSLN |
w/o $\mathcal{L}_{sgr}$ |
w/o SIP |
w/o SDP |
w/o Res |
English text + Korean speaker
Script: I was panicking last night because I wasn't sure what to do. |
||
---|---|---|
Target speaker |
||
CrossSpeech |
w/o M-DSLN |
w/o $\mathcal{L}_{sgr}$ |
w/o SIP |
w/o SDP |
w/o Res |
Chinese text + English speaker
Script: 纸巾纸一般具有湿韧强度。 |
||
---|---|---|
Target speaker |
||
CrossSpeech |
w/o M-DSLN |
w/o $\mathcal{L}_{sgr}$ |
w/o SIP |
w/o SDP |
w/o Res |
Chinese text + Korean speaker
Script: 山头叫包家岩。 |
||
---|---|---|
Target speaker |
||
CrossSpeech |
w/o M-DSLN |
w/o $\mathcal{L}_{sgr}$ |
w/o SIP |
w/o SDP |
w/o Res |
Korean text + English speaker
Script: 배꽃 떨어질 때 |
||
---|---|---|
Target speaker |
||
CrossSpeech |
w/o M-DSLN |
w/o $\mathcal{L}_{sgr}$ |
w/o SIP |
w/o SDP |
w/o Res |
Korean text + Chinese speaker
Script: 트윈 폴리오라는 그룹이 있었다. |
||
---|---|---|
Target speaker |
||
CrossSpeech |
w/o M-DSLN |
w/o $\mathcal{L}_{sgr}$ |
w/o SIP |
w/o SDP |
w/o Res |
English text + English speaker
Script: Produced the block books, which were the immediate predecessors of the true printed book. |
||
---|---|---|
Ground truth |
||
CrossSpeech |
w/o M-DSLN |
w/o $\mathcal{L}_{sgr}$ |
w/o SIP |
w/o SDP |
w/o Res |
Chinese text + Chinese speaker
Script: 下了车,杨花扑面,脸上痒痒的。 |
||
---|---|---|
Ground truth |
||
CrossSpeech |
w/o M-DSLN |
w/o $\mathcal{L}_{sgr}$ |
w/o SIP |
w/o SDP |
w/o Res |
Korean text + Korean speaker
Script: 넥타이, 벨트, 구두, 다 명품이더라 이러는데 |
||
---|---|---|
Ground truth |
||
CrossSpeech |
w/o M-DSLN |
w/o $\mathcal{L}_{sgr}$ |
w/o SIP |
w/o SDP |
w/o Res |