Tạo Phụ Đề Video Dựa Trên Kỹ Thuật Nhận Dạng Giọng Nói: Thử Nghiệm Cho Một Số Chương Trình Tại VTV

Hữu Phong Nguyễn; Nguyễn Quốc Bảo Võ; Minh Trung Trần

doi:10.54644/jte.71B.2022.1128

Các tác giả

Hữu Phong Nguyễn Đài Truyền hình Việt Nam, Việt Nam
Nguyễn Quốc Bảo Võ Học viện Công nghệ Bưu chính Viễn Thông Cơ sở tại TP.HCM, Việt Nam
Minh Trung Trần Đài Truyền hình Việt Nam, Việt Nam

Email tác giả liên hệ:

phongsolo@gmail.com

DOI:

https://doi.org/10.54644/jte.71B.2022.1128

Từ khóa:

Nhận dạng giọng nói, Tỉ lệ lỗi từ, Video theo yêu cầu, Dịch vụ OTT, Phụ đề chi tiết

Tóm tắt

Bài báo này trình bày kết quả thử nghiệm công cụ nhận dạng giọng nói Speech-To-Text (STT) cho các nội dung VOD (Video On Demand) trên hệ thống VTVgo của Đài THVN. Để đánh giá độ chính xác của công cụ STT, tỷ lệ lỗi từ (WER: Word Error Rate) được sử dụng để đo hiệu suất của hệ thống nhận dạng giọng nói tự động, dịch máy. Kết quả thử nghiệm thực hiện 10 thể loại chương trình truyền hình khác nhau với 1065 giờ video. Tỉ lệ WER thấp nhất là 2.8% đến 4.3% đạt được với một số thể loại chương trình thời sự và tin tức, dự báo thời tiết, ở đó phần lớn người nói, người dẫn chương trình (MC) đọc giọng chuẩn trong Studio và lời thoại từ một người nói, ít bị nhiễu bởi tạp âm bên ngoài. Bên cạnh đó, để minh họa ứng dụng phụ đề video, chúng tôi tiến hành thử nghiệm trên hệ thống VTVgo, tích hợp công cụ hiển thị phụ đề tùy chọn vào ứng dụng VTVgo app. Nền tảng thử nghiệm là SmartTV và SmartPhone Android, nhằm minh họa khả năng ứng dụng phụ đề video trên nền tảng phân phối nội dung số OTT (Over The Top).

Tải xuống: 0

Dữ liệu tải xuống chưa có sẵn.

Tiểu sử của Tác giả

Hữu Phong Nguyễn, Đài Truyền hình Việt Nam, Việt Nam

Phong Nguyen-Huu received the B.E. degree in Telecommunications Engineering from University of Transport and communications–Campus 2 (UTC2), Vietnam in 2006 and Master of Telecom from HCMC Posts and Telecommunications Institute of Technology (PTIT), Vietnam in 2014. From Aug 2016, he has been working toward the PhD. degree in Faculty of Telecommunications, Ho Chi Minh city University of Technology (HCMUT). Currently, he is working for Vietnamese Television (VTV). His research interests include the areas of mobile communication network (Two-way communications, Full-Duplex transmission), energy harvesting, audio/video coding and broadcast technology.

Nguyễn Quốc Bảo Võ, Học viện Công nghệ Bưu chính Viễn Thông Cơ sở tại TP.HCM, Việt Nam

Vo Nguyen Quoc Bao received the Ph.D. degree in electrical engineering from University of Ulsan, South Korea, in 2010. Dr. Bao is an associate professor of Wireless Communications at Posts and Telecommunications Institute of Technology (PTIT), Vietnam. He is currently serving as Director of the Wireless Communication Laboratory (WCOMM). He is senior member of IEEE. He is the Technical Editor in Chief of REV Journal on Electronics and Communications. He is also serving as an Editor of Transactions on Emerging Telecommunications Technologies (Wiley ETT), and VNU Journal of Computer Science and Communication Engineering. He served as a Technical Program co-chair for ATC (2013, 2014), NAFOSTED-NICS (2014, 2015, 2016), REV-ECIT 2015, ComManTel (2014, 2015), and SigComTel 2017. His research interests include wireless communications and information theory with current emphasis on MIMO systems, cooperative and cognitive communications, physical layer security, and energy harvesting

Minh Trung Trần, Đài Truyền hình Việt Nam, Việt Nam

Tran Minh Trung received his M.Eng. degree in Bachelor of Science at University of Natural Sciences in 1998 in Vietnam. Currently, he is working for vietnamese television station in the south region. He is interested in television technology and its application in life

Tài liệu tham khảo

G. Galvez, "Closed Captioning and Subtitling for Social Media," in SMPTE 2017 Annual Technical Conference and Exhibition, 2017. DOI: https://doi.org/10.5594/M001804

C. J. Hughes and M. Armstrong, "Automatic retrieval of closed captions for web clips from broadcast TV content," in National Association of Broadcasters Conference, 2015, pp. 318-324.

A. Lambourne, J. Hewitt, C. Lyon, and S. J. I. J. o. S. T. Warren, "Speech-based real-time subtitling services," vol. 7, no. 4, pp. 269-279, 2004. DOI: https://doi.org/10.1023/B:IJST.0000037071.39044.cc

N. Nitta and N. Babaguchi, "Automatic Story Segmentation of Closed-Caption Text for Semantic Content Analysis of Broadcasted Sports Video," in Multimedia information systems, 2002, pp. 110-116.

T. Imai, S. Homma, A. Kobayashi, T. Oku, and S. Sato, "Speech recognition with a seamlessly updated language model for real-time closed-captioning," in Eleventh Annual Conference of the International Speech Communication Association, 2010. DOI: https://doi.org/10.21437/Interspeech.2010-106

M. J. S. M. I. J. Armstrong, "Automatic recovery and verification of subtitles for large collections of video clips," vol. 126, no. 8, pp. 1-7, 2017. DOI: https://doi.org/10.5594/JMI.2017.2732858

P. Bell et al., "The MGB challenge: Evaluating multi-genre broadcast media recognition," in 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), 2015, pp. 687-693: IEEE. DOI: https://doi.org/10.1109/ASRU.2015.7404863

IBM, "AI Closed Captioning Services for Local and State Governments," vol. 2018, pp. 1-7

E. Costa-Montenegro, F. M. García-Doval, J. Juncal-Martínez, and B. J. U. A. i. t. I. S. Barragáns-Martínez, "SubTitleMe, subtitles in cinemas in mobile devices," vol. 15, no. 3, pp. 461-472, 2016. DOI: https://doi.org/10.1007/s10209-015-0420-5

M. Montagud, F. Boronat, J. Pastor, D. J. M. T. Marfil, and Applications, "Web-based platform for a customizable and synchronized presentation of subtitles in single-and multi-screen scenarios," vol. 79, pp. 21889-21923, 2020. DOI: https://doi.org/10.1007/s11042-020-08955-x

K. J. C. Ellis, Politics and Culture, "Netflix closed captions offer an accessible model for the streaming video industry, but what about audio description?," vol. 47, no. 3, pp. 3-20, 2015.

L. N. Y. Tirumala, "Captioning Social Media Video," Public Relations Education vol. 7, no. 1, pp. 169-187, 2021.

E. B. Marrese-Taylor, Jorge A Matsuo, Yutaka, "Mining fine-grained opinions on closed captions of YouTube videos with an attention-RNN," arXiv:02420, 2017. DOI: https://doi.org/10.18653/v1/W17-5213

P. J. L. Romero-Fresco and Communication, "Accessing communication: The quality of live subtitles in the UK," vol. 49, pp. 56-69, 2016. DOI: https://doi.org/10.1016/j.langcom.2016.06.001

J. Jarmulak, "Speech-to-Text Accuracy Benchmark: Word Error Rate for major Speech-to-Text platforms," October 31, 2021.

T. D. Mai Luong, "A Report on the Speech-to-Text Shared Task in VLSP Campaign 2019," presented at the VLSP, 2019.

N. T. M. D. Thanh, Phan Xuan Hay, Nguyen Ngoc Quy, Dao Xuan "Đánh giá các hệ thống nhận dạng giọng nói tiếng việt (vais, viettel, zalo, fpt và google) trong bản tin," Journal of Technical Education Science, no. 63, pp. 28-36, 2021. DOI: https://doi.org/10.54644/jte.63.2021.46

D. C. Tran, D. L. Nguyen, H. S. Ha, and M. F. Hassan, "Speech Recognizing Comparisons Between Web Speech API and FPT. AI API," in Proceedings of the 12th National Technical Seminar on Unmanned System Technology 2020, 2022, pp. 853-865: Springer. DOI: https://doi.org/10.1007/978-981-16-2406-3_64

D. C. Tran, D. L. Nguyen, M. F. J. B. o. E. E. Hassan, and Informatics, "Development and testing of an FPT. AI-based voicebot," vol. 9, no. 6, pp. 2388-2395, 2020. DOI: https://doi.org/10.11591/eei.v9i6.2620

Q. B. Nguyen, B. Q. Dam, and M. H. Le, "Development of a Vietnamese speech recognition system for Viettel call center," in 2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA), 2017, pp. 1-5: IEEE. DOI: https://doi.org/10.1109/ICSDA.2017.8384456

Q. T. Do, "VAIS-Speech: An Overview of Automatic Speech Recognition and Text-to-speech Development at VAIS," in VLSP 2018, Ha Noi, Vietnam, 2018.

G. Saon, B. Ramabhadran, and G. Zweig, "On the effect of word error rate on automated quality monitoring," in 2006 IEEE Spoken Language Technology Workshop, 2006, pp. 106-109: IEEE. DOI: https://doi.org/10.1109/SLT.2006.326828

A. Ali and S. Renals, "Word error rate estimation for speech recognition: e-WER," in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 2018, pp. 20-24. DOI: https://doi.org/10.18653/v1/P18-2004

Github. (2021). Available: https://github.com/belambert/asr-evaluation