Introduction
I am a PhD candidate working in the computational linguistics group at the University of Groningen. My research focuses on text generation from abstract dependency structures, under supervision of dr. Gertjan van Noord. This research is conducted as a part of the STEVIN DAISY project. Our task within the DAISY project is to build a sentence generator for Dutch.
Overview
The sentence generator build on the grammar and lexicon of the Alpino parser for Dutch, and consists of two components:
- A sentence realizer that transforms an abstract representation to sentences that conform to that representation.
- A fluency ranker that tries to choose the most fluent sentence from all sentences produced by the realizer.
Realization
Our sentence realizer generates sentences from abstract dependency structures. Abstract dependency structures are attribute-value structures that describe grammatical dependencies between words. Generation in performed in three steps:
- Based on lexical information in the dependency structure, a bag of words and their morphological information is assembled.
- Chart generation is performed using the grammar and the bag of words derived in the first step. All resulting derivation trees are packed into a so-called forest.
- The most fluent sentences are unpacked from the forest. Their fluency is determined by the fluency ranker.
To make a sentence realization practical in terms of speed, we should restrict realization to partial derivation trees that can potentially realize the initial dependency structure. In our system, such top-down guidance is achieved by injecting information about expected dependency relations into the feature structure of each lexical item.
Fluency ranking
The fluency ranker attempts to pick the most fluent sentence from all sentences that were produced by the realizer. Our fluency ranker uses a model that calculates scores for each realization, based on features of the realization, such as:
- The probability of the realization according an n-gram language model.
- Topicalization of subjects/non-subjects.
- Orderings in the middle field.
- Characteristics of the derivation tree.
Since features are extracted automatically, this can lead to huge models in terms of features. The feature set is reduced to the most effecive features using feature selection.
Study
Contact information
Daniël de Kok
Alfa-informatica
University of Groningen
Harmoniecomplex building 1311, room 435
Oude Kijk in ‘t Jatstraat 26
9712 EK Groningen
The Netherlands
E-Mail: d.j.a.de.kok%rug.nl