In our last post, we talked about how we use semantic tagging to sift through vast amounts of news and alternative media in order to surface relevant content for Alacra Pulse users.
An inherent challenge in semantic technology is the ability to accurately match variations of a name. Whether it’s looking for a person or a company, it’s not a trivial task for a computer to understand the myriad variations in names.
Some name variations are easy. When searching for people, one can create or obtain lists of common nicknames – Bill or Billy for William and Peggy or Maggie for Margaret. But name variations can be much more complex. Surnames may precede given names in certain countries; people use nicknames that are not tied to their given name; news articles contain typos or may simply misprint someone’s name.
With companies, it can be even more challenging. Companies are frequently referred to by their acronyms (IBM or AIG), by tickers (GOOG or MSFT), by their brands (iPhone or Prius) or by familiar names (“Marks & Sparks” for Marks & Spencer). It’s also critical to fully understand corporate family information. When an article talks about the Wall Street Journal or MySpace, we need to “understand” that they are talking about News Corp.
When tagging content for Alacra Pulse, our starting point is to identify the companies being talked about. And that means that whether it’s identified as the iPhone, AAPL or Apple, we need to accurately tag Apple in a story.
To ensure high levels of accuracy in our tagging, we rely upon the Alacra Concordance database. This database, which serves as the information management backbone for all Alacra products, is our master database of companies and identifiers. Tracking more than 400,000 companies, the Concordance database houses both public and proprietary identifiers, product and brand names, names of key executives and common nicknames for the companies. So when an analyst forecasts the number of iPhones to be sold or the declining user base at MySpace, we need to attribute those to Apple and News Corp.
Understanding name variations, corporate family trees and company-product relationships is a critical step in finding relevant nuggets from the web. And it’s one of the key building blocks we use to generate quality results in Alacra Pulse.



Steve
A fascinating insight into the trials and tribulations of practical semantic applications. I was intrigued to see that your concordance database is key to matching text to business entities. The obvious next question is... how is the database updated - by people or technology?
Neil
Posted by: Neil Infield | November 09, 2009 at 09:34 AM
Thanks, Neil.
The knowledge base uses a mix of technology and people to keep it updated.
We leverage technology wherever is practical, but there are critical points that require the judgment of human review. We believe the best solution combines technology, a priori knowledge and editorial oversight.
Posted by: Barry Graubart | December 03, 2009 at 02:43 PM