I have summarized how to automatically add English subtitles and English audio to Japanese videos. This uses Azure OpenAI Service’s Whisper and Speech Services.

Overview

The goal this time is to make a Japanese audio video multilingual as follows:

  • Japanese version: Original video (Japanese audio, no subtitles)
  • English version: English audio + English subtitles

Services Used

ServicePurpose
Azure OpenAI Service (Whisper)Translation from Japanese audio to English text
Azure Speech Services (TTS)Synthesis from English text to English audio
FFmpegAudio extraction and video merging

Procedure

1. Environment Setup

Required Tools

#b#priIePpnwystitihnanoslsntltaallFlilFlbmrppfayefrtgmihpeo(esnmg-adcoOtSe)nvrequests

Azure Configuration (.env)

AAAAZZZZUUUURRRREEEE____OOOOPPPPEEEENNNNAAAAIIII____EADANPEPDIPIP_L_OKOVIEYENYMRT=ES=yNIhoTOtu_NtrN=p-A2saM0:pE2/i=4/-w-xkh0xei6xys-xp0xe1.ropenai.azure.com

2. Extract Audio from Video

Since the Azure Whisper API has a 25MB file size limit, the audio is compressed and extracted.

ffmpeg-iinput.mp4vn-acodeclibmp3lame-b:a64k-ar16000audio.mp3

3. Generate English Subtitles with Whisper

Using the Azure OpenAI Service’s Whisper API, Japanese audio is transcribed while being translated into English.

iuh}w#wmreiiplatStodhahr=e"fdrvftraoiaeeo.fsppltspwr"ieeapaere{=-nsosniqAk(=n(tuZ{e"=sS"eeUya{eRs(sR"u{"TurtE:d"r=bes_ifeftsEAoisriipNZ.lpeltoDUmeoqelnPRp"nuesOE3:seseI_"es_.NA,a_tetTPufsne}I"do..x/_rirpstoKbomor)pE"_asteY)ftt"n,i"(,aal:uiser"/}"lwdas,"eur)pdthli"eaoo}asy_dmfefei#r:nlsteO=s:uh/tewpahuditesrpisen,r/SfaRiuTldeifsoo=/rftmiralatenss,ladtaitoan=sd?aatpai)-version=2024-06-01"

Key point: Using the translations endpoint, Japanese audio is directly translated into English text.

4. Generate English Audio with Speech Services

The generated subtitle text is synthesized into audio using Azure Speech Services.

uh}srres/elamssdlpp=e"""eorOCX=anfsco-oks"pnMfio>eh=-ti"c{i"tAec"etc"=t{pnr"ee"pito<nx>rsm-ssate:-Topm}q/Syfeeu/upta=e{be-k'sRs"OetEc:uvnsGrte-.Ii"prUpOpausSoNtpti-s}ipFoJt.olone(tnir=nut-cm'nrsKaa1yl.ett.N,syi"0ep"o:'uhe:nree/"xaacAsamldhPsul'e.Imdn>rm_lissiK+o==cEx-'hrYm1heo,l6tas"ktdo,hpefz:rt-/s.1/,c2wo8wdmkwa/b.tciwaot3=gr.snaositrmtegli-/.vm2eeo0nsn0ceo1or-/dvm1eip0(c3/'e"sus,yt/nfvt-1h8"e's)i)s'xml:lang='en-US'>

5. Create Timing-Synchronized Audio

To place audio according to subtitle timestamps, the following processing is performed:

  1. Generate individual audio for each subtitle segment
  2. Insert silence based on timestamps
  3. Concatenate all segments
#foPrros#i#t#icufefebIGxAstngetdasisacn_jcs#itepretutpnlrerosueAget>aa_taedttsldjeis0eept_uani:_eed=sclsaemuthseiucpraunldhoacssbcei(ttpetenosiiuegicufoaemtfewbnldelo(ita_nergtiu>dwtsahtdui:tplisrth,TeoeaheT[gtsS'imifgitseofalennmpexttpnto_/efc'odgre]us'o_,lresmpoagasntmattpgiethheonee)entmc:_pphdor_ueprfvaaiittlohitu)oesnrsegment

6. Create English Version Video

Finally, combine the original video’s visuals with the English audio.

ffmpe-ogcu:t-vpiuctoo_rpeiyng.i-mnmpaa4lp.m0p:4v:-0i-emnagpli1s:ha_:a0ud-isoh.omrpt3es\t\

Generated Files

FileDescription
original.mp4Japanese version (original video)
output_en.mp4English version (with English audio)
subtitles_en.srtEnglish subtitle file

Uploading to YouTube

Japanese Version

  • Upload the original video as-is

English Version

  1. Upload the video with English audio
  2. Add the SRT file in YouTube Studio > Subtitles tab

Cost

ServicePrice (approximate)
Azure OpenAI Whisper$0.006 / minute
Azure Speech Services$16 / 1 million characters

For a video of about 8 minutes, the cost is roughly a few dozen yen.

Summary

By combining Azure OpenAI Service’s Whisper and Speech Services, you can automate the process of creating English versions of Japanese videos.

Benefits:

  • High-accuracy translation (Whisper’s translation feature)
  • Natural English audio (Neural TTS)
  • Automatic generation of timestamped subtitles

Notes:

  • Adjustment of audio length and subtitle timing is required
  • Technical terms and proper nouns may need manual correction

Repository Structure

..gggvdegeeeianinnnetvteeewairrre/oosagaaarruuunttt.itbdoeeehgptir___tiuioesasmntt_uuyla_lebdnleentic.ns.ioem._mt.dpmeplp_4pn3eya4.sus.drpityo.py#######AESASMzxuuSueucbdybdrltintieuiociadtetcelgdlfreeeiednaldageupeeterdrsnanaiet/etov(iriinaaogeoltnewtsinosevtncrirraeasitwccpiekrtoreindptsbcyrigpitt)