- Redmond WA, US Jinyu LI - Redmond WA, US Liang LU - Redmond WA, US Yifan GONG - Sammamish WA, US Hu Hu - Atlanta GA, US
Assignee:
Microsoft Technology Licensing, LLC - Redmond WA
International Classification:
G10L 15/06 G06N 3/04 G06N 3/08
Abstract:
Techniques performed by a data processing system for training a Recurrent Neural Network Transducer (RNN-T) herein include encoder pretraining by training a neural network-based token classification model using first token-aligned training data representing a plurality of utterances, where each utterance is associated with a plurality of frames of audio data and tokens representing each utterance are aligned with frame boundaries of the plurality of audio frames; obtaining first cross-entropy (CE) criterion from the token classification model, wherein the CE criterion represent a divergence between expected outputs and reference outputs of the model; pretraining an encoder of an RNN-T based on the first CE criterion; and training the RNN-T with second training data after pretraining the encoder of the RNN-T. These techniques also include whole-network pre-training of the RNN-T. A RNN-T pretrained using these techniques may be used to process audio data that includes spoken content to obtain a textual representation.
Sequence-To-Sequence Speech Recognition With Latency Threshold
- Redmond WA, US Jinyu LI - Redmond WA, US Liang LU - Redmond WA, US Hirofumi INAGUMA - Kyoto, JP Yifan GONG - Sammamish WA, US
Assignee:
Microsoft Technology Licensing, LLC - Redmond WA
International Classification:
G10L 15/26 G10L 15/16
Abstract:
A computing system including one or more processors configured to receive an audio input. The one or more processors may generate a text transcription of the audio input at a sequence-to-sequence speech recognition model, which may assign a respective plurality of external-model text tokens to a plurality of frames included in the audio input. Each external-model text token may have an external-model alignment within the audio input. Based on the audio input, the one or more processors may generate a plurality of hidden states. Based on the plurality of hidden states, the one or more processors may generate a plurality of output text tokens. Each output text token may have a corresponding output alignment within the audio input. For each output text token, a latency between the output alignment and the external-model alignment may be below a predetermined latency threshold. The one or more processors may output the text transcription.
Layer Trajectory Long Short-Term Memory With Future Context
- Redmond WA, US Vadim MAZALOV - Issaquah WA, US Changliang LIU - Bothell WA, US Liang LU - Redmond WA, US Yifan GONG - Sammamish WA, US
International Classification:
G06N 3/08 G10L 15/22 G10L 15/16 G10L 15/183
Abstract:
According to some embodiments, a machine learning model may include an input layer to receive an input signal as a series of frames representing handwriting data, speech data, audio data, and/or textual data. A plurality of time layers may be provided, and each time layer may comprise a uni-directional recurrent neural network processing block. A depth processing block may scan hidden states of the recurrent neural network processing block of each time layer, and the depth processing block may be associated with a first frame and receive context frame information of a sequence of one or more future frames relative to the first frame. An output layer may output a final classification as a classified posterior vector of the input signal. For example, the depth processing block may receive the context from information from an output of a time layer processing block or another depth processing block of the future frame.
Machine Learning Model With Depth Processing Units
Representative embodiments disclose machine learning classifiers used in scenarios such as speech recognition, image captioning, machine translation, or other sequence-to-sequence embodiments. The machine learning classifiers have a plurality of time layers, each layer having a time processing block and a depth processing block. The time processing block is a recurrent neural network such as a Long Short Term Memory (LSTM) network. The depth processing blocks can be an LSTM network, a gated Deep Neural Network (DNN) or a maxout DNN. The depth processing blocks account for the hidden states of each time layer and uses summarized layer information for final input signal feature classification. An attention layer can also be used between the top depth processing block and the output layer.
Name / Title
Company / Classification
Phones & Addresses
Liang Lu
Summer2days2005 Internet Shopping
188-350 Columbia Street West, Waterloo, ON N2L 6P4 6472193952
Microsoft
Senior Applied Scientist
Toyota Technological Institute at Chicago (Ttic) Sep 1, 2016 - Aug 2017
Research Assistant Professor
The University of Edinburgh Sep 2012 - Jul 2016
Postdoctoral Research Associate
Orange 2006 - 2009
Research Intern
Education:
University of Edinburgh School of Philosophy, Psychology and Language Sciences 2009 - 2013
Doctorates, Doctor of Philosophy, Computer Science, Philosophy
Beijing University of Posts and Telecommunications 2003 - 2007
Bachelors, Bachelor of Science
Skills:
Machine Learning C++ Matlab Pattern Recognition Latex Algorithms Signal Processing Artificial Intelligence C Natural Language Processing Computer Science Data Mining Computer Vision Research Kaldi Speech Recognition Toolkit Htk Speech Recognition Toolkit R