The corpus used this year will be a subset of the Wikipedia XML Corpus of INEX 2009. This subset will be different than the one used last year. Mainly:
- Each document will belong to one or more than one categories
- Each document will be and XML document
- The different documents will be organized in a graph of documents where each link correspond to an hyperlink (or wiki link) between two documents
The corpus proposed is a graph of XML documents.
Semi supervised classification
In this track, the goal is to classify each node of a graph (a node corresponds to a document) knowing a set of already labelled nodes (the training documents). In the ML point of view, the track proposed here is a transductive (or semi) supervised classification task.
The following figure gives an example of classification task.
|Training set: The training set is composed of XML documents organized in a graph. The red nodes correspond to documents in category 1, the blue nodes corresponds to documents in category 2. The white nodes correspond to documents where the category is hidden. The goal of the categorization task is to find the categories of the white nodes|
|The goal of the categorization models are to find the color of the unlabelled nodes of the training graph.|
The evaluation measure for categorization will be ROC curves and F1 measure