Recently I have remembered an old idea/observation I had on the ranking of xml documents. At this time, I even wrote to my professor P. Gallinari to know what he thought about. He seemed to be open to discuss about it. But after some times, I forgot and never developed the idea. Free to you to use it if you think it’s interesting and not already used…
Three years ago I developed a simple search engine to index/search xml document (metadata document). The search engine used Lucene, a open source java library in charge of the core of the search engine. With Lucene, you can associate a weight to a specific part of the document to “boost” the ranking of a document if the terms given by a full-text search is found in this part (e.g. the document’s title). So If you have a document with differents parts interesting differents types of users (e.g. a report with Technical and Executive part), you can change the values of the weights (0.0= don’t mind, 1.0 normal, N more important) following the interest of the user and have a personalized ranking model. An article including shortly this work was published in a French e-learning Journal.
One year after, I reworked for the pleasure on the search engine, and more generally the ranking for a full-text search on xml documents, which was a common problem in the Information Retrieval field because of the raise of XML in the industry. I read some articles about the subject. All the articles took in account a subset of this information to build their ranking model :
- statistics on the terms (like TF IDF)
- semantic of the term (extraction of the semantic context)
- structure of the xml schema
- statistics on the xml tags
The semantic analysis of the tags
Despite the fact that some ranking modesl are very well thought, no one used the semantic information of xml tags. In fact, in a XML document, you have different types of tags; for example in the xhtml schema, you have:
- meta-data (<keyword>, <author>, <year>..)
- structural (<title>,<p> , <div>,<ul>..)
- reference (<a>) (maybe included in the structural markup)
- layout, Emphasis: (<em>,<strong> <u>,…)
NB: The XML attribut “xml:lang” may also be interesting because of its specific semantic and so its relevance in the ranking of a collection of a multilanguage xml documents)
Maybe it’s not really useful. But why don’t use all the information you have on the document to improve the accuracy of the ranking model? Adding this information in the ranking model could be useful sometimes.
In xml ranking, a simple/common idea for ranking documents could be: the more the terms are found deeply in the tree structure, the less they are important for the ranking (e.g a document with a term in its title is a priori more relevant than another document with the same term in a subsubsection). There are some papers basing their ranking formula upon this idea. But this idea is only true for structural tags and not for layout tags. Without modification of the ranking model, both of these two examples have the same rank because of their same deepness from the root:
The problem is that the <i> tag is not a structural tag, so doc1 should be more relevant than doc2 because “term” is structurally closer to the doc1’s root. The idea of deepness is only relevant for structural tags.
If the author uses emphasing tags it’s not for nothing. He wants we pay attention to emphazed words. So we can consider that a emphased word is a priori more important than a normal word. However with the same idea of depness than Example 1, the classical ranking model doesn’t work:
doc1: <p>bla bla <i>term</i> bla bla </p>
doc2: <p>bla bla term bla bla</p>
In a classical ranking model, without take in account the semantic analysis of the tag, If I search “term”, the doc2 should be the first result because in doc2 “term” is less structurally deep than in doc1. which is not wished because in doc1 “term” is emphased, so important for the document doc1. Thus the document doc1 should be the first…
The fact to know a term is a metadata (nested in a metadata tag) should also play a role in the ranking, to rank differently the document from a document having this term in its corpus.