By: Jessica Fishman
A news story aired regarding the economic impact of closing schools during the Covid-19 virus outbreak. Is this a story about the economy? Education? What about public health?
The answer is that it’s all of the above.
So how do you get an algorithm, a process that relies on hard math and data, to capture something nuanced and subjective? At Resonance AI we struggled with this exact question and ultimately arrived at a solution that both recognizes and accounts for the inherent subjectivity of news stories.
First we began by picking fifteen categories that together would capture the vast majority of news coverage. Through consulting professionals within the news industry, many of which are part of the Resonance AI team, we landed on fifteen categories that were ubiquitous throughout the industry.
Next, we created definitions for each of these categories so that our team and our customers would be aware of what each meant. For example, is a story classified as “Accident” referring specifically to traffic accidents? What if someone were injured during a hike? Creating these definitions was a significant challenge and required further research to confirm our definitions aligned with industry expectations.
Now came the tricky part, creating an AI algorithm that would detect these subjects and group them into stories. But first, we had to create a corpus, a vast set of data that would train our AI to recognize patterns associated with each type of subject. For this we gathered hundreds of news stories from various news sources from around the country, utilizing a television archive database. At first, we gave each story a single classification, but quickly realized that this would not be sufficient — news stories are simply too subjective.
We began to apply a primary subject, as well as secondary and tertiary subjects. This allowed us to capture some of that subjectivity within the model that would be training our AI. After capturing and categorizing hundreds of stories, we used this data to train an AI model that could recognize and categorize stories on its own (also, our subject corpus is continually growing).
Similar to the way we trained the model, the algorithm itself would output more than one result. A story about Covid-19 and its impact on the economy wouldn’t need to be placed in either public health or economy and business; instead, the story could be 65% economy and 35% public health.
It’s also important to point out that a subject and a story are two different things. The entirety of a newscast can have tons of subjects identified, but to be able to recognize one cohesive story from another is quite different. Our data scientists used a methodology that compared subject classification of each scene to both the scene before it and after it, in order to identify if it was likely part of the same story.
So why does this matter? When using AI to analyze anything it’s incredibly important that the model being used has been trained on the specific subject matter it’s intended to analyze. General AI won’t be able to capture the nuances and intricacies of local news if it wasn’t trained specifically on local news. Utilizing industry experts to help us understand how technology should view the news was key. There are plenty of video analysis tools available, but without being trained on a specific video type, they cannot deliver the necessary insights to the industry they’re trying to help.