Skip to main content
Fig. 8 | Computational Cognitive Science

Fig. 8

From: Transforming an embodied conversational agent into an efficient talking head: from keyframe-based animation to multimodal concatenation synthesis

Fig. 8

Schematic representation of the visual synthesis system. Given a series of diphones (e.g., “Welcome”), the system searches in the diphone dictionary the candidates and creates a trellis of diphones (i.e., a list of diphone candidates). The best series of diphones is selected using a concatenation cost, i.e., the aim is to minimize the RMS distance between values of the articulatory parameters of the last frame of the previous diphone and the first frame of the current one. The selected diphones are linked by a blue line in this example. If consecutive diphones come from the same sentence, there is no gap at the concatenation borders (e.g., this is the case for the diphones El, lk and k@ in this example). In other cases, there is a gap (Δ) at the concatenation that must be minimized. A gapless processing step is applied to each articulatory parameter. It consists of adding a small value (equal to Δ/T*frame_index, where Δ corresponds to the difference between the values of the articulatory parameter of the last frame of the current diphone and the first frame of the following one, T corresponds to the total number of frames of the current diphone and frame_index corresponds to the current frame number) at each time step to reduce the gap while keeping the temporal nonlinear variation of the parameter values

Back to article page