Topix.net – world’s largest news Website – launches with 150K categories

News and information online is so vast that it can be unmanageable in its volume. This is precisely why news aggregation search Websites like Yahoo News and Google News are growing like crazy. Last month marks the launch of a new entry – Topix.net, said to be the world’s largest online news Website with over 150,000 news categories and news sources from around the world.

Rich Skrenta, CEO of Topix.net and co-founder of Netscape’s Open Directory Project took a few minutes out of his news day to explain his huge online news creation.

Q: What does your site do and how does it work?

Skrenta: Topix.net is reading all of the news published online constantly and categorizing stories, both geographically as well as by subject. We’ve got a basic news roll-up for every zip code in the United States – for 30,000 towns across the U.S. We also have 150,000 subject categories. We have a page about every sports team, about every celebrity, every music style. We even have a page about mobile home manufacturing, which has a surprisingly large amount of news on it.

Q: You can’t fit 30,000 zip codes on your front page. Does the cream of the crop rise to the top so that when you go to your home page you see some of the major hot topics?

Skrenta: We have links to the major cities in the country and users can type in zip codes and go to a page just for news about their town. We’ve got some of the categories surfaced on our front page: U.S. and world news, journalism news, health, technology – basically a random assortment of categories from deep within our system. But to get the full experience of Topix, you have to click around and experience the full breadth by viewing the internal parts of the site.

Q: There are a lot of people saying you’re the largest news Website that’s ever been created. Would that be a true statement?

Skrenta: Based on the number of categories, yes. If you look at Google News, they’ve got eight categories: health, politics, entertainment, sports and so on, basically corresponding to the standard Associated Press taxonomy. Yahoo News has 100 full coverage sections. We have 150,000 pages. Our goal is to have a page constantly updated from the broadest variety of sources about every person, place and thing in the world. We haven’t done every person, place or thing yet, but we’ve done the first 150,000. We’ve got a page about every public company. There are 5,500 public companies. We’re tracking references to every disease and drug – both brand and generic – 21,000 sports personalities, 45,000 celebrities, anyone who’s ever been in a movie, anyone who’s ever put out a music album.

Q: Building out your keyword database – you must have spent a lot of long nights working.

Skrenta: Yeah – it’s a massive knowledge base which drives our system in conjunction with some artificial intelligence that we developed. The knowledge base knows the name of every street in the country, every bridge, tunnel, hospital, school, body of water, baseball stadium, park – in addition to the other subjects and keywords its looking for. It’s about 10 million lines of text that are constantly being looked for in every story that comes through our system.

Q: What’s the difference between what Topix.net is doing and what Google is doing in terms of the broad picture of what you’re actually indexing?

Skrenta: If you go to Google News and want to get information about, say, IBM, you’d find a lot of stories that contained the three letters “IBM,” but it might not be a good relevant overview of IBM’s current business. When we look at a story, we’re trying to determine not just if a story contains certain keywords, but actually if it’s about the concept that our topic is about.
A story we recently saw said, “Dot-com survivors have aged like fine bordeaux.” Now this is a reference to a style of wine and we have a wine page in Topix, but it’s not a wine story. It’s a business story. It’s a stray reference to something else. If you search for bordeaux on Google News, you would get this story. But it’s not what you’re looking for. Our system can tell the difference between stray references to concepts and stories that are actually about the concept.

Q: You come from an incredible background of creating things we now take for granted on the Web. Have you now created this artificial intelligence (AI) that’s just proprietary to Topix that no one else has duplicated?

Skrenta: Yeah, it’s pretty unique in the industry. I haven’t seen anything else like it. We looked at what had been done in academic AI. If you’re going to develop an AI technique, 85 percent accuracy is pretty good. But for our purposes, we had to get far above 85 percent to make the stories look good on a page. If our AI was only 85 percent, that would mean that on every one of our pages, 2-3 stories would be bad. We had to get way above 99 percent.

Q: It’s a pleasing site. You don’t get just a list of headlines; it’s formatted beautifully and very well organized. It’s a pleasure to read and it’s exciting.

Skrenta: I’m glad to hear you say that. When we looked at creating a look and feel for the site, we looked at a bunch of newspaper sites out there. I didn’t really feel like a lot of them looked like a newspaper. The Wall Street Journal Online is the one that was closest to a newspaper look and feel.
We did some research and found that newspaper layout design is actually a rich feel with a 150-year history and there are books and guidelines about rules to follow to make things visually appealing in a print newspaper. Things like if you have a photo to a story and the photo is a picture of a person in profile, the person should be facing the text of the story; very subtle, not obvious rules about how to do newspaper layout properly. When we looked at online newspapers, many didn’t follow any of these rules at all. I couldn’t figure out why – maybe the separation between the print and online divisions at the company. We thought we’d bring some of these rules to bear on a Website design, adapt it to the Web and come up with something a little more reminiscent of a newspaper.

Q: You were one of the founders of the Open Directory Project. What did you learn from building the Open Directory that you’ve applied to this new site?

Skrenta: The Open Directory was built with 60,000 volunteer editors. We built a giant Web directory similar to Yahoo, but 3-4 times bigger than Yahoo Directory. It’s actually the directory tab on Google.com, in addition to being used by AOL and Netscape. It was a very successful project, but we sort of took the opposite tact with Topix.net. We have zero human editors at Topix – it’s all done with AI. Humans are really good at some tasks, but the scale of what we’re doing here is just so vast that we couldn’t have humans manually editing stories or selecting topics or categorizing them. It’s too big a project for even 60,000 people to undertake.

Q: What’s fascinating is what it seems like what you’re building is somewhat of a dynamically populated directory. It looks like you still have that focus on categorizations of content, which is different than a regular search engine. Was your vision to create a directory-type of search engine?

Skrenta: What we’re trying to do is classify text by concept instead of keyword. When you go to Google and type in a name like Scott Peterson or Janet Jackson, these are actually relatively common names. There are thousands of people in the country with those names. Some of them make it into the media, besides the ones we commonly think of. We wanted a system that could be intelligent enough to decide a document was actually about that concept as opposed to being a strict keyword match.

The full audio interview with Rich Skrenta can be heard WebTalkRadio.com.
Dana Greenlee is co-host/producer of the WebTalkGuys Radio Show, a Tacoma-based nationally syndicated radio and webcast show featuring technology news and interviews.