<div dir="ltr">Hi guys,<div><br></div><div>I am looking for some advice on how to use a speech-to-text model with a fpc program designed to teach reading of invented words composed from 8 brazilian portuguese phonemes (four consonants and fours vowels).</div><div><br></div><div>So, right now (<a href="https://github.com/cpicanco/stimulus-control-sdl2/blob/hanna/src/sdl.app.audio.recorder.devices.pas">https://github.com/cpicanco/stimulus-control-sdl2/blob/hanna/src/sdl.app.audio.recorder.devices.pas</a>) the program uses SDL2 to record short 4-5s audio streams and save each recording to a wav file using fpwavwriter. Each audio stream/file is supposed to be a word spoken by a student during a recording/playback session of a word presented on screen. The participant will click a button to finish the session. Then, the program will start a speech-to-text routine and give some feedback.</div><div><br></div><div>There will be two speech-to-text routines. The first one will be a human transcription (nothing new here for me). The second one will be an IA transcription.</div><div><br></div><div>I am looking for an approach to read the raw stream (or the saved file if no direct stream support) and pass it to a speech IA model (for example, whisper) and then return some text output for further processing.</div><div><br></div><div>Using python, Whisper Medium (multilanguage), I got some good (although slow) results without any fine tuning. However, I am considering using Transformers if any fine tuning turns out to be necessary. </div><div><br></div><div>So, in this context, what would be "the way to go" for using the final model with free pascal? Calling a script with TProcess? Please, can you shed some light on here?</div><div><br></div><div>Best regards,</div><div>R</div></div>