Fork me on GitHub
#data-science
<
2022-01-23
>
sova-soars-the-sora22:01:01

Hi, I have 3 inputs: 1. Audio Clip of some spoken language 2. Language that the Audio Clip is in (e.g. Japanese, Swahili, Swedish, et cetera) 3. Transcript of clip (script, text of the audio) What I want as output: Ability to highlight word-by-word as playback happens. Thinking: + Need timing file (.srt) that has startTime->EndTime for each "word" + Need to write program that can read startTime and EndTime array from the file and toggle DOM class (cljs) while playback is occurring to highlight each word. Sample context: "Hi my name is V" audio file with subtitle text "Hi my name is V" written underneath. Actual Result: as the audio plays "Hi" the portion of the subtitle "Hi" is highlighted, as "my" is read, the word "my" is highlighted as "name" is read, the word "name" is highlighted as the next word is read, "is" is highlighted + V for each word while it's being played we have that word highlighted in the output. So I'm wondering, how can I derive (or generate, or create) these output files with startTime, endTime, and word being spoken? I am thinking on the overarching concept as: 1. partially train a neural net to know how different words are pronounced from some dataset like an online database of people reading single words (name of example service escapes me) 2. then, use this neural net as a basis to categorize full phrases word-by-word where the "words" were learned in step 1. This gives us a final "word-by-word" sequence that must match the script and knows exactly when in the audio each word starts being spoken and when it stops. 3. i guess we're done then huh 😄

sova-soars-the-sora22:01:13

I need help getting oriented. I do not know where to start crafting such a tool. I have read a couple of ML books but had not really had a reason to train a network until now-ish. So any guidance & direction you can give (or full implementation if you feel like it! =P) would be appreciated

sova-soars-the-sora22:01:21

I think this thing ^ ought to exist for every language where there is enough data to be found online. And could be open-sourced for educators, content-creators, and karaoke machine makers everywhere to benefit from.