Advances in application programming interface (API) for speech recognition are impacting our everyday lives from smartphones to the Internet of Things (IoT). According to Katy Levinson on Quora.com, “…an API forces structured data-based … exchanges between what the designer made and the outside world which wants to use it.” For example, the software application on a smart device is the user/client program, and the API orchestrates the user’s inputs and the app’s outputs via a list of commands and programmed formatting. In the future, Oi (2017) predicted that the IoT will make user experiences with innovative designs more natural than using your smartphone for tasks. Oi explained that the API economy is feeding user design innovation and vice versa especially with some companies providing their API as open source.
Speech recognition has zoomed past its humble beginnings. In the 1950s, voice recognition began with syllables, vowels, and numbers (Juang & Rabiner, 2004). For example, in 1952, Bell Laboratories created the Audrey system to recognize spoken numbers from designated individuals (Pinola, 2011). In the 60s, IBM created the Shoebox machine to recognize random users voicing of a limited number of spoken words, numbers, and basic math computations. Speech recognition flourished in the 70s with the U.S. Department of Defense funding programs such as Carnegie Mellon’s HARPY machine that could understand 1000 words (Pinola, 2011). Pinola described how speech recognition turned to predictive statistical modeling (i.e., Hidden Markov Models) in the 80s, which spurred systems to grow exponentially in vocabulary recognition.
This interview with Dr. Patricia Scanlon, Founder and CEO of SoapBox Labs, focuses on their new cloud-based API to improve children’s aural/oral interactions with their smart toys and gadgets via voice recognition and behavior. SoapBox Labs’ API proprietary models provide the following affordances for smart devices specifically for children: voice control, conversational engagement for entertainment and education, and speech assessment for language and literacy.
Dr. Scanlon has a doctorate in digital signal processing in electrical engineering from the University College-Dublin. She’s an artificial intelligence advisor for TechIreland, and a mentor for the Founder Institute in County Dublin. She was involved in research and development at Bell Laboratories for seven years before founding SoapBox Labs. She’s now in her fifth year at SoapBox Labs, which focuses on API development for voice-enabled devices (e.g., videogames, smartphone apps, smart home devices, and toys) in the children’s market.
Tell me how your speech recognition technology is unique to the current API market.
SoapBox Labs have developed a proprietary speech technology for use exclusively with children. Our mission is to create the world’s most accurate and accessible speech technology for kids under 12 – with a particular emphasis on developing a voice interface that is engaging, age-appropriate and safe. Our technology is licensed to 3rd parties to voice-enable their products in a wide variety of sectors and application areas (Home Devices, IoT, robotics, games, AR/VR and Education – reading and language learning).
How can developers use your API?
Developers use our technology by simply sending an audio file to our API and our systems responds in near real-time.
For educational reading or language learning assessment our system can respond with how well a word or phrase was pronounced, the fluency of the child reading as well as evaluating comprehension answers.
For voice control for home/IoT devices, toys, games or robots, our system will respond with which voice command was said by the child e.g. on/off, forward, backward etc..
For conversation and engagement with the child our system recognises key words in the child’s response to a prompt in order to engage the child and continue the interaction and conversation e.g. what age are you? What is your favorite color? How many legs does a spider have? What is your favorite farm animal? etc.
What are the educational benefits of children interacting with smart devices with robust voice recognition API?
Voice technology can enable educational applications to advance reading or language learning skills by acting as tutor, listening as they read aloud, assessing, correcting and prompting where necessary.
Research shows that best practice for developing students’ word recognition, fluency, and comprehension is one-on-one guided repeated oral reading, when a child reads aloud alongside a helpful adult.
However, such individualised daily support is not possible due to lack of resources: for example, just 10 minutes a day with each of 25 children in a classroom would require fully two-thirds of the school day.
Hiring additional reading tutors is not economically viable for schools due to overstretched public budgets. Few children have the luxury of a private tutor. Due to lack of resources, strategies such as reading in groups are often employed but their impact on reading acquisition pales in comparison to individual guided oral reading.
This 2011 report from the Joan Cooney Center2 proposed the use of Speech Recognition technology to enable assessment for automated reading tutors for developing children’s literacy. In such applications speech recognition technology can give the computer ‘ears’ to track the reading position of the child, detect oral reading errors (e.g. substitution, omission, hesitation etc.), prompt the child where necessary as well as assess reading fluency and comprehension of the text being read.
Thus providing one-to-one guided oral reading instruction as a helpful adult would do.
While speech technology cannot replace reading instruction by teachers/parents, it can provide a cost-effective and scalable teaching aid that can help improve a child’s literacy and language learning skills through more regular practice.
Can you provide 1 or 2 seminal research papers on the topic that have influenced you work?
The Joan Cooney paper from 20112 certainly has been a great influencer.
How does your API address the security and privacy of the user’s voice and behaviors when incorporated in a smart device? Do you have a security policy?
All of our speech data is collected, processed and stored in full compliance with strict data privacy regulations globally. When we work with our clients, we take measures to ensure that they also comply with these regulations. This ensures that we are legally compliant from a security and privacy perspective.
However, we go a step further in terms of ‘online safety’, by enabling our customers to design experiences that are also age-appropriate. By using our API, our customers are integrating with speech technology that has been designed exclusively for use by children – and our privacy by design approach means that we ensure that the data is used only to improve the speech technology product itself.
Can parents control the privacy settings for the device or application? Or are the security and privacy of the user the responsibility of the smart device manufacturer and application software developers?
As we said above, we request that our clients (third party developers, device manufacturers, etc) conform to the explicit consent requirements of the legislation where they are operating before a child can use the product (normally this means informed consent from parents/guardians of children under 13, for example with US COPPA and EU GDPR legislation).
This privacy by design approach enshrines the rights of data subjects into our platform – for example, by supporting the right of erasure as a matter of principle, allowing parents/guardians to request deletion of all email/data at any time. Crucially, though, this also means that the device manufacturers and software developers are able to deploy our service such that they can provide a ‘gated’ experience for children – and one that is age appropriate, safe and differentiated. In that sense, we are enabling an approach which delivers a much more engaging and appropriate voice interface for kids – whilst at the same time promoting their security and privacy.
What are the ethical considerations you face as an API developer?
We ensuring compliance with global data privacy rules. We also advise our clients on how to comply with these regulations and monitor compliance on a regular basis. Our privacy by design approach means that the speech data collected is used solely to improve the speech recognition system, is never used for any other purpose and is never shared outside of the company.
What are the limitations of current technology that place barriers on your company’s API development goals?
While we recognise for speech from kids as young as 3 or 4 and this speech does not need to be perfect, one limitation however, is that the for the recognition system to work, the child’s speech must be understood by an adult, other than their parents.
In your opinion, what does the future hold for voice recognition and machine learning?
Voice is the most natural form of communication. Voice technology is set to replace typing, mouse clicks, touch and gesture as the dominant way to interface with technology. We can already see it used in our homes on smart assistants such as Amazon Alexa, Google Play, Apple Homepod etc. to play music to set timers and beginning to be integrated into lots of home devices, games, VR/AR experiences etc.
Voice interactions have many positive benefits including allowing children to naturally interact with technology without the need for screens. Young preliterate children benefit from using their voice to naturally interact with technology, as opposed to typing/clicks as well reaping the benefits of its use in education such as enabling automated reading and language learning tutors for improving literacy and language learning skills.
Oi, R. (2017). Experience design innovation. NTT data technology foresight 2017—Examining future technology trends and how they will affect us. NTT Technical Review. Retrieved from https://www.ntt-review.jp/archive/ntttechnical.php?contents=ntr201710fa13_s.html
Juang, B. H., & Rabiner, L. R. (2004). Unpublished manuscript retrieved from http://www.ece.ucsb.edu/Faculty/Rabiner/ece259/Reprints/354_LALI-ASRHistory-final-10-8.pdf
Pinola, M. (2011, November 2nd). Speech recognition through the decades: How we ended up with Siri. PCWorld. Retrieved from https://www.pcworld.com/article/243060/speech_recognition_through_the_decades_how_we_ended_up_with_siri.html
 National Reading Panel: Teaching children to read: An evidence-based assessment of the scientific research literature on reading and its implications for reading instruction: Reports of the subgroups, 2000