
Blind and low-vision people request explanations for YouDescribe videos, but only 7% are complete. AI is speeding up processes. Credit: Matthew Moderno/Northeastern University
For people with blind or poor vision, audio descriptions of action in movies and TV shows are essential to understanding what is happening. Networks and streaming services hire experts to create audio descriptions, but this is not the case for billions of YouTube and Tiktok videos.
That doesn’t mean people don’t want to access content.
Using AI Vision Language Models (VLM), researchers at Northeastern University have made audio descriptions available in user-generated videos as part of a crowdsourcing platform called YouDescribe. Like libraries, blind and low vision users can request descriptions of the video and make subsequent rates and contributions.
“I understand that a 20-second video about Tiktok in Tiktok might not give a professional explanation,” says Lana Do, who earned her Masters in Computer Science from the Silicon Valley campus in Northeastern in May. “But blind and low-minded people might want to see the dance video as well.”
In fact, the 2020 video of the song “Dynamite” by the Korean boy band BTS is at the top of YouDescribe’s wish list and is waiting to be explained. The platform has 3,000 volunteer accountants, but the wish list is so long that it can’t keep up. Only 7% of the requested videos on the wishlist have an audio description, Do says.
I work in Ilmiyun’s lab, where I teach computer science professors on the Silicon Valley campus. Yoon joined the YouDescribe team in 2018 to develop the machine learning elements of the platform.
This year, we added new features to speed up the human loop workflow in YouDescribe. New VLM technology provides better quality explanations, and the new Infobot tool allows users to request more information about a particular video frame. Low-Vision users can even fix mistakes in the description in the collaboration editing interface, Do says.
As a result, video content is explained better and more quickly becomes available. AI-generated drafts reduce the burden on human explainers and allow users to easily engage in the process through ratings and comments, she said.
“They could say they were watching documentary sets in the woods and heard the sound of unexplained flapping.
DO and her colleagues recently published a paper at a symposium on the interaction of human computers for work in Amsterdam on the possibility that AI will accelerate the development of audio descriptions. AI does an incredibly good job, says Yoon by explaining human facial expressions and movements. In this video, the AI agent explains the steps a chef takes while making cheese rolls.
But there are some consistent weaknesses, she says. AI is not good at reading facial expressions in manga. And overall, humans are excellent at picking up the most important details in the scene. This is an important skill for creating useful explanations.
“It’s very labor-intensive,” says Yun.
Graduate students in her lab compare the first draft of AI to what a human explainer creates.
“Then we’ll measure the gap so that we can train our AI to do a better job,” she says. “Blind users don’t want to be distracted by too many verbal explanations. It’s editorial arts to verbalize the most important information in a concise way.”
YouDescribe was launched in 2013 by the San Francisco-based Smith-Kettlewell Eye Research Institute and trained volunteers who were spotted in creating audio descriptions. Focusing on YouTube and Tiktok videos, the platform offers recording and timing narration tutorials that will allow user-generated video content to be accessed.
Provided by Northeastern University
This story has been republished courtesy of Northeastern Global News news.northeastern.edu.
Quote: AI Vision Language Model Provides Video Descriptions for Blind Users (2025, June 30) Retrieved July 1, 2025 from https://techxplore.com/news/2025-06-Ai-vision-language-video-descriptions.html
This document is subject to copyright. Apart from fair transactions for private research or research purposes, there is no part that is reproduced without written permission. Content is provided with information only.
