Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Nov 22:13:1267.
doi: 10.3389/fnins.2019.01267. eCollection 2019.

Generating Natural, Intelligible Speech From Brain Activity in Motor, Premotor, and Inferior Frontal Cortices

Affiliations

Generating Natural, Intelligible Speech From Brain Activity in Motor, Premotor, and Inferior Frontal Cortices

Christian Herff et al. Front Neurosci. .

Abstract

Neural interfaces that directly produce intelligible speech from brain activity would allow people with severe impairment from neurological disorders to communicate more naturally. Here, we record neural population activity in motor, premotor and inferior frontal cortices during speech production using electrocorticography (ECoG) and show that ECoG signals alone can be used to generate intelligible speech output that can preserve conversational cues. To produce speech directly from neural data, we adapted a method from the field of speech synthesis called unit selection, in which units of speech are concatenated to form audible output. In our approach, which we call Brain-To-Speech, we chose subsequent units of speech based on the measured ECoG activity to generate audio waveforms directly from the neural recordings. Brain-To-Speech employed the user's own voice to generate speech that sounded very natural and included features such as prosody and accentuation. By investigating the brain areas involved in speech production separately, we found that speech motor cortex provided more information for the reconstruction process than the other cortical areas.

Keywords: BCI; ECoG; brain-computer interface; brain-to-speech; speech; synthesis.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Experimental Setup: ECoG and audible speech (light blue) were measured simultaneously while participants read words shown on a computer screen. We recorded ECoG data on inferior frontal (green), premotor (blue), and motor (purple) cortices.
Figure 2
Figure 2
Electrode grid positions for all six participants. Grids always covered areas in inferior frontal gyrus pars opercularis (IFG, green), ventral premotor cortex (PMv, blue), and ventral motor cortex (M1v, purple).
Figure 3
Figure 3
Speech Generation Approach: For each window of high gamma activity in the test data (top left), the cosine similarity to each window in the training data (center bottom) was computed. The window in the training data that maximized the cosine similarity was determined and the corresponding speech unit (center top) was selected. The resulting overlapping speech units (top right) were combined using Hanning windows to form the generated speech output (bottom right). Also see Supplementary Video 1.
Figure 4
Figure 4
Generation example: Examples of actual (top) and generated (bottom) audio waveforms (A) and spectrograms (B) of seven words spoken by participant 5. Similarities between the generation and actual speech are striking, especially in the spectral domain (B). These generated examples can be found in the Supplementary Audio 1.
Figure 5
Figure 5
Performance of our generation approach. (A) Correlation coefficients between the spectrograms of original and generated audio waveforms for the best (purple) and average (green) participant. While all regions yielded better than randomized results on average, M1v provided most information for our reconstruction process. (B) Results of listening test with 55 human listeners. Accuracies in the 4-option forced intelligibility test were above chance level (25%, dashed line) for all listeners.
Figure 6
Figure 6
Detailed decoding results. (A) Correlations between original and reconstructed spectrograms (melscaled) for all participants and electrode locations. Stars indicate significance levels (* Larger than 95% of random activations, *** Larger than 99.9% of random activations). M1v contains most information for our decoding approach. (B) Detailed results for best participant using all electrodes and the entire temporal context (blue) and only using activity prior to the current moment (cyan) across all frequency coefficients. Shaded areas denote 95% confidence intervals. Reconstruction is reliable across all frequency ranges and above chance level (maximum of all randomizations, red) for all frequency ranges.

Similar articles

Cited by

References

    1. Akbari H., Khalighinejad B., Herrero J. L., Mehta A. D., Mesgarani N. (2019). Towards reconstructing intelligible speech from the human auditory cortex. Sci. Rep. 9:874. 10.1038/s41598-018-37359-z - DOI - PMC - PubMed
    1. Angrick M., Herff C., Mugler E., Tate M. C., Slutzky M. W., Krusienski D. J., et al. . (2019). Speech synthesis from ecog using densely connected 3d convolutional neural networks. J. Neural Eng. 16:036019. 10.1088/1741-2552/ab0c59 - DOI - PMC - PubMed
    1. Anumanchipalli G. K., Chartier J., Chang E. F. (2019). Speech synthesis from neural decoding of spoken sentences. Nature 568, 493–498. 10.1038/s41586-019-1119-1 - DOI - PMC - PubMed
    1. Black A. W., Taylor P. A. (1997). Automatically clustering similar units for unit selection in speech synthesis. EUROSPEECH (Rhodes: ), 601–604.
    1. Bouchard K. E., Mesgarani N., Johnson K., Chang E. F. (2013). Functional organization of human sensorimotor cortex for speech articulation. Nature 495, 327–332. 10.1038/nature11911 - DOI - PMC - PubMed