Feature selection for maximum entropy modeling

Introduction

featuresqueeze performs maximum entropy-based feature selection. Selection is performed by attempting to add estimate feature weights one at a time, where the weight/feature that provides the highest gain is added (Berger et al., 1996). To make this computationally feasible, we assume that the weight of a feature added to a model stays constant.

featuresqueeze also implements a fast feature selection method, that assumes that gains of features rarely improve when a feature is added to the model (Zhou et al., 2003).

This program has been developed for compacting models for the Alpino fluency ranker for Dutch. Comments/suggestions/questions can be sent to: Daniël de Kok <>.

Availability

FeatureSqueeze can be retrieved from Github. In a nutshell, you can download the latest development version with:

git clone git://github.com/danieldk/featuresqueeze.git

Github also offers the possibility to download the sources as an archive.

Compilation

Requirements:

  • C++ standard library with TR1 extensions.
  • The Eigen C++ template library for linear algebra.

g++ 4.2.x satisfies these requirements. With these components in place, compile by simply executing ‘make’ in the top-level directory.

Usage

Feature selection can be performed using the ‘fsqueeze’ command:

fsqueeze [OPTION] dataset

  -a val   Alpha convergence threshold (default: 1e-6)
  -c       Correlation selection
  -f       Fast maxent selection (do not recalculate all gains)
  -g val   Gain threshold (default: 1e-6)
  -n val   Maximum number of features
  -o       Find overlap (incompatible with -f)
  -r val   Correlation exclusion threshold (default: 0.9)

Where ‘dataset’ is a data set in TADM format minus the optional header line.

To do

  • Test with more datasets, the program currently fails on some datasets that are very much different from mine.

License

This application is free software; you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation; either version 2.1 of the License, or (at your option) any later version. This license is included in the file COPYING in the top-level directory.

References

  1. A maximum entropy approach to natural language processing, Adam L. Berger, Vincent J. Della Pietra, Stephen A. Della Pietra, Computational Linguistics, Volume 22, Issue 1, March 1996
  2. A fast algorithm for feature selection in conditional maximum entropy modeling, Yaqian Zhou, Fuliang Weng, Lide Wu, Hauke Schmidt, Theoretical Issues In Natural Language Processing, Proceedings of the 2003 conference on Empirical methods in natural language processing, Volume 10