Genre incubates

September 21, 2007 on 3:33 pm | In structured data web | Comments Off

OK, it is time to start creating a plan for releasing DataDig — a data search engine. First step is to create a photo album of screen shots of what could be done with DataDig — in data research, in shopping: two applications that illustrate the generality of DataDig’s engine. Second step is to host two kick-the-tires small examples in both of those application areas. The last step is to setup a way for folks to try out DataDig either by submitting data to the hosted site or by downloading and installing DataDig (a more complicated step). Not all details are worked out yet, but there it is. At least the photo album should show up shortly.

Genre — the beginning

August 2, 2005 on 3:33 pm | In structured data web | Comments Off

Around 1997, when we were right in the middle of developing a distributed, peer-to-peer collaboration system at Crystaliz, Inc., I was intrigued about how search engines were taking off on the Web. There was Lycos and AltaVista and Excite. One heard about two students at Stanford who were beginning to work on improving search results.

I felt that doing yet-another-search-engine was a waste of time (how wrong I turned out to be) and felt that there were two other types of information that were available on the Web that not very many people were paying attention to. The first was what my friend Sundar calls as the hidden web. This is the data that is contained in RDBMS (typically) backed web sites that hawk goods and the second was the data available from the Government or other such data publishing entities. The second is what I will call the structured web.

So, what do I mean by structured web and what is different about structured web and Tim Berners-Lee\’s semantic web? Semantic web orginated with the belief that if only the world would adopt first order logic formalisms and extensions thereof to describe meta data, then programs could find the ‘meaning’ of data and hence operate on them automatically. With DARPA funding, this got extended to DAML — a knowledge representation language. I have not kept up with these developments and don\’t know how many people are using it, but let us just say that has not caught on fire.

By structured web I mean structured data which ideally will have meta data describing its structure and which is used in different research tasks. Examples range from financial filings by companies to EPA data on super fund sites to Bureau of Labor Statistics reports on census.

I was naive enough to think that if only one could build a ’search’ engine for data, then one could find a lot of people using it to find information from the structured web. I also felt that while data mining (which is an automated way of searching for patterns in data) could be improved if only humans were allowed to guide the mining process. Hence Genre was born.

We submitted a proposal to DARPA and Ron Larsen who was then the assistant director of the program office that was looking at digital library issues funded the project that was called Genre, Highd (for high dimensional data), and most recently DataDig.

So, where are we? The hidden web problem is what Adam Bosworth talks about and folks at A9 OpenSearch as as well as the new extensions to RSS are beginning to tackle in a very, very limited way. Don\’t get me wrong. Limited is good for a lot of Web applications used by all of us. If you are interested in sharing book marks, the list extension proposed by MSFT might be good. If you are interested in classic search engine results being merged correctly, the work being done by the folks at a9 is great.

I came from a different direction. I felt that just as search engines opened up the world of documents to folks who could access and learn from those documents, one needed a completely new kind of search engine to research data. The hidden web problem is different because the underlying data is often made opaque by user interfaces and more importantly it changes dynamically (you only need to look at Amazon to know this). So, the right way to solve the problem of \’searching\’ the hidden web turns into a \’federation\’ problem where one is setting up virtual views that may generate multiple queries to create the information presented to the user. I am also aware of other more interesting approaches to this problem being worked on in my neck of the woods, but, unfortunately, I am not at liberty to disclose them. Suffice to say that the word \’streams\’ should give you a hint.

While we started looking at the ‘hidden web’ problem (by crawling over ecommerce sites, travel sites, etc.), we ended up focuing on the ’structured web’ problem. RDF was beginning to be defined and we found it to be too general and overly complex. DAML did not reduce the complexity, but increased it. And, here we were, wanting an easy way of getting at data that you typically find in very large spreadsheets or RDBMS databases. No luck! No XML based standard for getting at database dumps in XML. So, we had to invent one. But, I am getting ahead of myself.

Earlier, I said that I was naive. What did I mean, well, maybe it is obvious to you, but the problem of \’data based research\’ is lot more complex than \’text retrieval\’ based research. While all the domains that \’read\’ documents can use text retrieval, data based research is different depending on the domain. Worse, much of the public data that is available from your tax payer supported government sources make it so hard to understand and get at the data, there is a significant industry of suppliers who take the data published in raw ascii or other such formats and create RDBMS databases to sell to users.

So, what is needed to solve this problem so that one could build search engine like research tools for the structured web that use publically available data to support research needs that we as citizens, researchers, etc. may have.

IMHO, we need to solve the following problems:

a) We need a common format for database dumps to be published. Before some one tells me that SQL scripts dumped from an RDBMS solve this problem, remember that much of the data I am talking about is not in RDBMS.

b) We need a flexible search engine that can support (a) search, (b) analysis, and (c) mining. Maybe there is more people would like to do with structured data, but this list about covers it for me. First users find the data of interest from a vast collection(search), next they look for patterns in the data (analysis) using queries or visualization, and finally use intuitions gained by \’looking\’ at those patterns to create models of the data (mining).

c) This may not be obvious, but if we want existing analysis applications (e.g., Excel) on the one hand and web browser based or otherwise “rich client” applications on the other (whether these are made up of applets or AJAX based user interfaces) we need an XML based protocol to support querying of the data over HTTP in (b).

Anyhow, that is what we have been working on. We will be releasing the XML repository format, the XML based protocol, and other related stuff soon. We are also trying to make money by selling the search engine we developed. More details to follow.

Powered by WordPress with Pool theme design by Borja Fernandez.
Entries and comments feeds. Valid XHTML and CSS. ^Top^