At the C-STAR consortium meeting held in Trento on December 2002,
the decision was taken to organize, on a regular basis, speech translation
evaluation campaigns and workshops, mainly focusing on speech translation
research and evaluation.
ctivities within C-STAR will as well include the
development of a large multilingual parallel corpus to be used for common
evaluations.
Evaluation Campaign 2003
The first evaluation campaign and workshop will be in May 2003 and Sept. 2003,
respectively.
This year, both events will be restricted to CSTAR members only,
and the evaluation will be limited to written texts.
In particular, training and
testing data will be based on the BTEC corpus developed by ATR and extended
by the partners to their respective languages.
Specifications
– The first evaluation campaign will concentrate on assessing text translation algorithms in the tourism domain. Translation directions will be from Chinese, Italian, Japanese, and Korean into English, for the primary condition, and any
other direction for the secondary condition.
– Training data will consist of a fixed amount of English sentences provided
with translations into the respective source language. Participants will be allowed
to use any additional monolingual resources, e.g. text corpora, grammars, word lists,
segmentation tools.
– Test data of the primary condition will consist of English sentences taken from phrase-books not included in the training data. Test data for the
the secondary condition will consist of manual translations of the English sentences
into all the considered source languages.
– The primary condition will be mandatory for all participants. Participants will
be invited to submit more runs for each condition, possibly corresponding to
different translation directions.
Evaluation Protocol
– Automatic scoring will be carried out with the NIST/BLEU software. In particular, a
server will be set-up that will permit participants to remotely score the output of their system. Hence, for each translation direction, multiple translations will be used as references.
– Subjective evaluation on the primary condition will be distributed across the participant sites. English native speakers will evaluate the output of each system against one gold-standard reference. The evaluation will follow guidelines similar to those applied by LDC
in the NIST MT evaluation campaigns.
– While automatic evaluation will be applied to all submitted runs, subjective evaluation will
be applied to only one run per participant, namely the first run submitted under the primary
condition.
– Finally, participants are allowed to discuss their results without restriction. Disclosure of the
results of other participants are not allowed without their permission.