The Marilyn Manson Wiki:Automatic speech recognition

From MansonWiki, the Marilyn Manson encyclopedia
Jump to: navigation, search

The sheer volume of material dedicated to Marilyn Manson and everything related to him makes complete archiving of such information impossible. A significant portion of this content exists only in audio or video formats, which further complicates preserving it in textual form. To partially mitigate this issue, in some cases, the use of additional speech recognition software is permitted. This method can be particularly useful for transcribing and archiving various interviews that have not been published online in text form.

Currently, MansonWiki employs two methods for obtaining automatic transcriptions: by downloading subtitles from YouTube, the main source of audio and video material, provided that the video includes them or at least they have been automatically generated; and by using specialized tools. It should be noted that both methods (in their automatic form) are not perfect and may have shortcomings, resulting in imperfect speech recognition. Although these methods use advanced techniques, they do not take into account the specific context of Marilyn Manson's work and history. As of today, no perfect software solution exists.

Pages transcribed in this way are marked with a special label, accessible with a special template, explaining that, for technical reasons, the transcription may be imperfect, and the pages are added to a dedicated category.

Using YouTube-generated or uploader supplied subtitles[edit]

To obtain subtitles from YouTube, you can use tools like yt-dlp, which, in addition to downloading audio and video, also support downloading thumbnails and subtitles. Example usage:

   yt-dlp --extract-audio --write-subs --write-auto-subs --sub-format=ass/srt/best --sub-langs=en [{<YOUTUBE_URL>|<YOUTUBE_VIDEO_ID>}...]

Using speech recognition software[edit]

In cases where YouTube-provided subtitles are imperfectly generated, other options can also be tried, namely offline recognition. Of course, this requires having the audio file available, but using certain tools, the recognition process can be fine-tuned to achieve better results.

OpenAI has made a tool called Whisper publicly available, which is a representative of the ASR (automatic speech recognition) class. Running this software requires powerful hardware (the main criteria being the presence of a GPU and a large amount of RAM for more advanced recognition models), as the speech recognition process is computationally intensive. Example usage:

   whisper --device=gpu --language=en [<FILE>...]