Forward with Feet on the Ground - Speech Technology the Finnish Way

By Antti Arppe

Research in speech technology has been undertaken in Finland as early as the 1960s, with some results having an international reknown or impact such as the portable Synte 2 speech synthesizer in the 1970s and the phonetic typewriter in the 1980s. There have also been some individual speech products on the market since the early 1990s; however, their clientele have been limited mainly to special groups such as the seeing-impaired. At the turn of the millenium a clear change has been witnessed. Both the public and the private sectors have embarked on major research and development projects in speech technology, which are starting to bear fruit - there now exist several basic technological solutions for both speech recognition and synthesis of Finnish which are on par with any language. Coupled with this, the last year has seen a significant increase in the launch of new types of end-user speech applications, some of which are finally being addressed at the general public.

Speech technology makes its first major public entry in Finland at cinemas

As a sort of insider in the Finnish language technology community I was somewhat bedaffled this spring while getting to my seat at a cinema. Between all sorts of fast-paced, MTV-style advertisements I suddenly realized that I was witnessing a real-time, walk-through demonstration of an automated telephone number service using speech analysis and synthesis. Someone was finally launching in earnest a real speech technology product or service to the general public in Finland. This was not a demonstration of a working prototype to academics at a scientific conference, at a software fair or at a sales pitch to potential venture capital investors with all the obligatory demo effects. This was for real. This someone was Fonecta, a former subsidiary of the Finnish national telecommunications company Sonera. Maybe it tells something about a healthy reservedness of Finnish speech technology researchers in general towards their own work and its potential commercialization after all too much hype in the last few years that a colleague of mine, Senior Researcher Martti Vainio from the University of Helsinki, commented afterwards that he first thought the commercial to be a joke, adding "to us who live in that world [i.e. speech technology], it [Fonecta's advertisement] is just inherently so comic, since we live on the other side of the mirror." Might this new service beat our researchers' prejudices in the minds of the general public?

In their service Fonecta has integrated its comprehensive database covering basically all private telephone number listings in Finland - including both wired and wireless listings provided by all Finnish telecommunications operators - with the speech recognition technology from the Israeli-American Phonetic Systems and speech synthesis from the Finnish-based Timehouse. "Phonetics' system was specifically geared at handling up to tens of millions of different names, which was a prerequisite in a service such as this. As a consequence, several other potential, serious providers of technology got ruled out", says Senior Product Development Manager Timo Mattero at Fonecta. Fonecta's coverage was nation-wide from its very launch in November 2001, adding up to some 5.5 million listings. This made it one of the first if not the first such national telecommunications service of its type in the world.

"As the service presently stands, the automated service will probably not have a major impact on the number service market in general as it is integrated with the traditional human service - if the automated service fails the call is automatically routed forward to a human operator. And Finland is a difficult market to introduce automated products", explains Product manager Leo Rantanen. But he adds that the service is not a mere technical toy and does aim at a clear market niche: "The service - costing only a fraction of the human alternative - is mainly aimed at people who pay their phone bills from their own pocket, that's to say young people who are also probably more willing to and adept in using new technology despite its drawbacks. Nevertheless, the service appears to be used by people of all sort's of backgrounds and ages." The service has passed its launch phase as Rantanen is glad to note that "We have already some heavy users, such as small businesses, which seem to have realized the low cost of the service".

After a long dry season - New products and services keep popping up

Fonecta's launch of an end-user speech-technology based service or product is no longer one of its kind. It had been rather quiet on the commercial front since Timehouse introduced their speech synthesis module for Finnish in 1991, building on cooperation with the Laboratory of Acoustics and Audio Signal Processing at Helsinki University of Technology (HUT) and the Finnish Federation of the Visually Impaired. Later on, specifically in the beginning of this decade, several other companies both Finnish and international followed suite by providing their versions of these basic building blocks of speech technology for Finnish - Babel Infovox (1993) and Fonix introduced solutions for Finnish speech synthesis and Philips, Lingsoft and IBM introduced solutions for Finnish speech recognition.

Indeed, the market for Finnish speech technology is slowly showing some signs of taking off. As Managing Director Jaakko Happonen from Lingsoft puts it "We are seeing the first steps in the development of a market structure that already exists in larger markets such as the United States. There are finally several providers of basic speech technology for both recognition and synthesis of Finnish. Though the quality is not yet near good enough for ideal, untrained running speech recognition, it is sufficient for specific, limited tasks such as voice command or for the needs of special groups such as the seeing- or hearing-impaired, doctors, dentists and other specialists. Thus, we are seeing press-releases of such applications containing speech technology in one form or the other every other month."

End-user applications for the general public went on lacking until around Fonecta's launch, but that is when things started to change. Already in the end of 2000, the Finnish distributor Konttorityö brought to the market a Finnish version of Philips FreeSpeech Viva, a speech recognition program aimed at all users of PC's whether at home or at work. In 2001, Lingsoft released Lingsoft Parrot, which speaks out loud in Finnish, Swedish or English any text under the mouse pointer on a PC screen. This year, Tikka Communications, a Finnish regional telecommunications operator has launched earlier this year Puhesähköposti, a service developed by PT ControlNet, that converts e-mails to speech so that they can be listened to over the phone.

Speech technology is still also used to improve products for specialist groups. A Finnish provider of dentists' patient data management software, Entteri, is launching this autumn a new version of their AssisDent product which will now include a Finnish speech-driven interface. And finally, TietoEnator, the leading Nordic IT company, recently announced that it was initiating later this year at a local healthcare center the beta testing of a new version of Effica, their comprehensive primary and special health care management system, which will now include a speech-driven interface from Philips covering all areas of the basic health care process.

The integration of speech recognition technology with this breadth of coverage on the Nordic health care market is first of its kind according to Director Hannu Puuronen at TietoEnator Healthcare Finland. He definitely sees speech technology as a competitive advantage, as he expects that "once speech recognition is included as a fully functioning component of our patient data base system, as we expect our pilot phase to be over in 2003, I do indeed believe that all of our healthcare station and hospital customers will switch over to this new version." He envisions that this product concept can be internationalized, but he notes that "in order to be a credible provider of a similar solution [integrating speech recognition] in other countries, one should already be established as a provider of the underlying basic system, as TietoEnator is presently in the other Nordic countries, [i.e. Sweden, Norway and Denmark.]"

Building on ten years of experience in commercializing speech synthesis of Finnish, Managing Director Kristian Töyrä from Timehouse stresses also the importance of quality in getting a foothold on the general market: "The clientele of our speech synthesizer Mikropuhe has practically been limited to the seeing-impaired. We have tried throughout the years to initiate co-operation with several telecommunications operators, but none of these have proceeded further before the recent Fonecta case because the quality has not been good enough. An un-motivated user, who for instance does not need a speech synthesizer for browsing the web, simply does not want to listen to synthesized speech. We hope that through improving the quality of our product we can extend our markets from the present niche to the more general market."

Which application uses whose basic technology for Finnish?

Application Application provider Technology Technology provider
Free Speech Viva Konttorityö Philips High Speed and High Accuracy (HSA II) Philips
Effica TietoEnator Speech Magic Philips
Assisdent Entteri LSSR Lingsoft
020200 Fonecta Voice Search Engine VSE (recognition) + MikroPuhe (synthesis) Phonetic Systems, TimeHouse
aids for the visually impaired Kuulolaitekeskus/Oriola Infovox 330/Infovox Desktop Infovox
mobile phones Nokia - Nokia Research Center (recognition)
mobile phones Benefon - VoiceSignal (recognition)

Which application developer has licensed Finnish recognition or synthesis from which technology provider?

Developer Technology Technology provider
Siebel/US FAAST (synthesis) Fonix/US
PipeBeach/Sweden FAAST (synthesis) Fonix
Nuance/US FAAST (syntesis) Fonix
- ETI-Eloquence (synthesis) Speechworks/US
- Voximizer (recognition & synthesis = Voice command) Voxi/Sweden (coming up)
- DirectTalk (recognition) IBM/US

Anxiety of being left behind in technological development and research?

Though the Finnish market apparently is not yet awash with speech technology products, they seem to be making an entry of sorts. This is quite a contrast to some rather common views in the late 1990s that Finland and Finnish were in the danger of being left behind of the major world languages. Just in 1999, the Nordic Council of Ministers issued a statement in which it was feared that the Nordic languages, including Finnish, were being increasingly marginalized in respect to the availability speech technology as they were felt to be of lower priority for international IT companies. Experts for the Council thus recommended the systematic collection of speech data that could facilitate development of such technology. With a similar air of urgency, the Finnish National Technology Agency (TEKES) raised natural language applications, specifically in speech, as one of the key development areas in its multidisciplinary technology program User-Oriented Information Technology USIX, which lasted 1999-2002. The lack of speech recognition for Finnish (which was still a case in 1999) was seen as a "major bottleneck for language engineering applications in Finland" (source).

Looking back with several years of hindsight, Happonen from Lingsoft believes this public support from TEKES has played an important role and will continue to do so, noting from a corporate viewpoint that "Speech recognition and synthesis technology for Finnish would certainly have been undertaken without TEKES money, though companies are used to exploiting this funding when available, but a lot of basic research would have remained undone without TEKES support." Many of the interviewees also mentioned EU funding in this light. For instance, Assistant Research Manager Péter Boda from Nokia Research Center notes that "EU funding has allowed us to explore new ideas in speech and language technology with partners that are experts in their corresponding fields."

Speech technology applications require inherently that the necessary basic research is undertaken, and in order to stay at par with the development in other languages that this research is continued. What is then the status of Finnish speech technology research at the moment? Professor Unto K. Laine from the Laboratory of Acoustics and Audio Signal Processing at the Helsinki University of Technology starts by remarking that Finnish research in speech technology has long and respectable traditions all the way back to the 1960s, and continues that "a definite early landmark was the portable Synte 2 text-to-speech synthesizer of Finnish that Professor Matti Karjalainen and I developed in the late 1970s, which was the first of its kind for any language at that time. Since Karjalainen received his tenure here at HUT in the early 1980s, our unit has been active in numerous areas of speech technology." Concerning the present situation, Laine notes on his part that "a key issue that one has to acknowledge is that Finnish differs considerably phonetically from all Indo-European languages, e.g. English and Germany. First of all, Finnish is a quantity language, which means that the durations of sounds form a distinctive feature: takka /tak:a/ 'furnace' and takaa /taka:/ 'behind' differ in meaning. Another feature of Finnish is that nouns and verbs have a high number of different inflected forms. In speech recognition of unlimited vocabulary the size of the core lexicon nevertheless has to be limited, and thus our language models end up being very complex. Present speech recognition technology has been to a great extent developed for languages which do not have these features and this will be a real challenge for us". Presently Laine leads the joint USIX STT (Speech-to-Text) project aiming at the speech recognition of Finnish with unlimited vocabulary. Whereas Laine's team specializes in speech acoustics, analysis and coding as well as auditory modelling, another team at the Neural Networks Research Centre (HUT) focuses on language technology and new IT methods for automatic speech recognition (ASR)".

In the view of Mikko Kurimo, acting professor and leader of the speech recognition team at the Neural Networks Research Centre (HUT), Finnish researchers are not lagging behind in the research and development of state-of-the-art algorithms in the field of speech recognition. In Finland the emphasis of the research paradigm is first and foremost on language-independent algorithms, which is somewhat problematic from the viewpoint of a smaller language such as Finnish. According to Kurimo, "the basic prerequisite in that one's research is acknowledged internationally is that its results can be compared with those of other research. This becomes a problem, if common testing material is not available - both the training and the testing material should be exactly the same and in the same language. If one wants to compare one's new improved algorithm with a similar one that has been published and tested earlier with English testing data - which very often is the case - one has to resort to same English test material. This skews speech recognition research towards English even in Finland. Thus it is clear that the features of Finnish have been studied less than English, for instance, but this is also a research issue in a small country such as Finland".

Nevertheless, speech recognition research in Finland has had impacts on the global scale. As Finnish has a phonetic pronunciation that is considerably and systematically closer to orthography than is the case for English, for instance, this has made it natural for Finnish researchers to pursue phonetic approaches. Thus, the phonetic typewriter developed by Academician Teuvo Kohonen in 1988 has probably become the best known Finnish innovation in speech recognition. This prototype was an application of the neural network concept for recognizing words phoneme by phoneme, which according to Kurimo turned out to be "a great tonic at least in some languages similar to Finnish, where the English-based approach focusing on full-word recognition had come to a standstill at that time", and he continues that "Even this year researchers still contact me for additional information concerning Kohonen's original article." As a general consequence of this earlier research, Finnish researchers in continuous speech recognition have ended up studying phonemic recognition models at an earlier stage than their colleagues concerned with English, where full-word recognition has been the paradigm until not so long ago. An example of this phonetic line of research is the above mentioned USIX STT project.

Significant speech technology research has also been undertaken on the corporate side, where advances have been achieved e.g. in robust and multilingual speech recognition at the Speech and Audio Systems Laboratory of Nokia Research Center. "In the mid-1990s, we were considered as just another implementer of research results in speech technology", remarks Olli Viikki, Senior Research Manager at the Laboratory, and continues that "but presently, in our own field, i.e. speech recognition in embedded systems, I believe that we are now one of the leading research units in the world - in the last two years we have been asked to give plenary presentations or attend workshops on multiple occasions." On the other hand, Viikki admits that their scope is somewhat narrow, focused to the recognition of restricted vocabularies of some tens or hundreds words, but he adds that this is to a certain extent the consequence of the general incomplete nature of speech technology, "We have come up and tried out the weirdest of concepts, but with a simulation one can quickly see that not many are worth implementing." Nevertheless, mobile phones are in Viikki's view one of the application areas where speech technology can be of the greatest benefit to users, and he notes with reserved Finnish pride that "knowing Nokia's market share on the mobile phone market, I would dare say that the application for Voice Dialing that we have developed here at Nokia in Finland is probably one of the most largely spread speech recognition applications in the world."

Finnish research in speech synthesis has in the last few years to a great extent revolved around SuoPuhe, the joint project for Finnish speech technology involving several universities and private organizations and financed by the National Technology Agency TEKES within the USIX framework. A major achievement in this project has been the development of a speech synthesizer for Finnish based on the diphone principle combined with prosodic information. "This approach is presently state-of-the-art - in mastery of the established core technologies we are at the same level as is the case for other world languages, though we have not always managed to try out the very latest fads such as unit-selection - which is not necessarily a bad thing when resources run thin", says senior researcher Martti Vainio responsible for the SuoPuhe project at the University of Helsinki. In fact, he notes that the new synthesizer includes some features that place it at the very forefront of the field, such as text expansion of non-standard strings e.g. numbers, which in Finnish require to be inflected in the same form as their head noun - an aspect that one does not have to consider greatly in most Indo-European languages. The inclusion of an existing full-scale parser instead of the present part-of-speech tagger is next in line in the development of the synthesizer in order to improve the predition of emphasis and pausing. Vainio believes that the SuoPuhe project can partly credit its success in closing the gap in speech synthesis between Finnish and other languages to the existence and availability of Festival, an open source speech synthesis platform developed at Centre for Speech Technology Research at the University of Edinburgh, as the research group did not need to get involved in signal processing. What Vainio sees as problematic is that the present number of researchers knowledgeable in the speech synthesis technology is in his view not enough to both pursue basic research and develop and commercialize the present results.

What are the future prospects of Finnish speech technology?

The consensus of the interviewed Finnish speech technology researchers is that in the core areas, whether speech recognition or synthesis, the level and standard of the present research prototypes and on-going research for Finnish are on par with the current international state-of-the-art. All also agree that the Finnish researchers have nothing to be ashamed of compared to their foreign colleagues - and that in some respects Finns are actually world-leaders. Nevertheless, Vainio of the University of Helsinki feels that the breadth of of Finnish speech technology research community is too narrow - "individual researchers, even if of world class, can only concentrate on a few areas concurrently. Speech technology such as synthesis covers so many and varied aspects ranging between the core areas of continuous processing of acoustic signal and symbolic linguistic models that one would really need more researchers to effectively bridge these two poles, if one would want to accelerate the development and not be dependent on the research interests of a few talented individuals." Regarding this, Kurimo from Helsinki University of Technology however warns from experience as a post-doctoral fellow abroad that "increasing resources and researchers may in the worst case only lead to even more specialization and splintering of the field, where the whole is not necessarily more than the sum of its parts."

In this respect, Viikki from Nokia sees that concerning the organization of speech technology research in Finland,"On the research front, it is presently a clear problem for speech technology that the field lacks a dedicated academic chair in Finland. Presently speech technology is to a great extent rather an application area of technologies such as neural processing or signal processing than an autonomous research area. In this respect the field would greatly benefit from a situation where speech and its processing would be the actual starting point."

Though the Finnish language industry has been providing speech technology products for over a decade, and finally seems to getting some steam in this line of business, there is rather a feeling of reserved anticipation in the air than of high hopes of fast money, growth and internationalization. As Leo Rantanen of Fonecta says "For the first company this is just a long and rocky road. With half a euro per call you can just imagine when the investment will pay off. But at least one paves the way for others." On his part, Jaakko Happonen of Lingsoft puts the speech technology market - specifically from the viewpoint of a small technology company - in a larger perspective: "In the beginning there is always a lot of interest in principle, but when one starts talking business, the interest evaporates. The general principles of IT sales are hindering the breakthrough of speech technology in Finland. No one wants to pay for pilot projects, especially when the quality of technology is still not as good as is expected, but no one wants to be left behind when all the others start buying. The old Valley of Death principle revisited." As a follow-up Happonen ponders "What would be a star end-user product in speech technology? Something that people need in their everyday life, but which they don't shy away from. The definition is quite abstract. But who knew that text messages would become such a hit as they have in Finland?" Almost as if an answer to this, Kurimo notes that sometimes an innovative, well-designed interface using the simplest of language technology can utterly discredit a solution based on the latest technological "break-throughs". In conclusion, one could sum up the present situation with Viikki's opinion that "speech technology in general is in the research and development phase, which means that there is still a lot to do from the research point of view. This understandably restricts presently what can be successfully commercialized."

Acknowledgements:

The following people have been interviewed or have commented this article, for which assistance and information the author of this article is most grateful: Péter Boda, Nokia Research Center; Mickel Grönroos, CSC-Scientific Computing; Jaakko Happonen, Lingsoft; Matti Karjalainen, Helsinki University of Technology; Kimmo Koskenniemi, University of Helsinki; Seppo Koskenniemi, IBM Finland; Mikko Kurimo, Helsinki University of Technology; Unto K. Laine, Helsinki University of Technology; Erkki Lumivirta, IBM Finland; Katri Luostarinen, CSC-Scientific Computing; Timo Mattero, Fonecta; Manne Miettinen, CSC-Scientific Computing; Jyrki Mäki-Laurila, Konttorityö Oy; Hannu Puuronen, TietoEnator; Leo Rantanen, Fonecta; Kristian Töyrä, Timehouse; Martti Vainio, University of Helsinki; Olli Viikki, Nokia Research Center; and Nicholas Volk, University of Helsinki.