Training Corpus

The training corpus is composed of:

  • A category file that gives the set of documents considered in this track and the categories of the training documents
  • A link file that gives the links between the documents
  • A content file that corresponds to normalized tf-idf vectors computed by the organizers over this collection.

The three files can be downloaded here : http://www-connex.lip6.fr/~denoyer/inex2009/corpus_train.tar.gz (about 300 Mbytes).

Note that the participants can either use the provided tf-idf vectors or the original documents from the INEX 2009 Collection.

Category file

Each line corresponds to one document. Each line is composed of:

  id_of_the_document,category1,category2,.....

where

  • id_of_the_document is the id of the document in the original INEX 2009 collection
  • each category is a string (corresponding to one of the 40 wikipedia portals kept in that corpus)
  • the ?? category means that the corresponding document is a test document. The participants will have to provide the score of these '??' documents for each possible category.

Link file

Each line corresponds to one document. The strucutre is:

   id_of_the_document, link_to1, link_to2, ....

where

  • id_of_the_document is the id of the document in the original INEX 2009 collection
  • each following column corresponds to the destination of a link. For example, the line
    53,45,48,100head

means that there is a link from doc 53 to doc 45, from doc 53 to doc 48, from doc 53 to doc 100

Content file

Each line of the content file is structured as follow:

 id [feature:values]+

where

  • id is the id of the document
  • the following columns correpond to a normalized tfidf vector in the svmlight format
GlossyBlue theme adapted by David Gilbert
Powered by PmWiki