Feature selection for maximum entropy modeling
Introduction
featuresqueeze performs maximum entropy-based feature selection. Selection is performed by attempting to add estimate feature weights one at a time, where the weight/feature that provides the highest gain is added (Berger et al., 1996). To make this computationally feasible, we assume that the weight of a feature added to a model stays constant.
featuresqueeze also implements a fast feature selection method, that assumes that gains of features rarely improve when a feature is added to the model (Zhou et al., 2003).
This program has been developed for compacting models for the Alpino fluency ranker for Dutch. Comments/suggestions/questions can be sent to: Daniël de Kok <>.
Availability
FeatureSqueeze can be retrieved from Github. In a nutshell, you can download the latest development version with:
git clone git://github.com/danieldk/featuresqueeze.git
Github also offers the possibility to download the sources as an archive.
Compilation
Requirements:
- C++ standard library with TR1 extensions.
- The Eigen C++ template library for linear algebra.
g++ 4.2.x satisfies these requirements. With these components in place, compile by simply executing ‘make’ in the top-level directory.
Usage
Feature selection can be performed using the ‘fsqueeze’ command:
fsqueeze [OPTION] dataset
-a val Alpha convergence threshold (default: 1e-6)
-c Correlation selection
-f Fast maxent selection (do not recalculate all gains)
-g val Gain threshold (default: 1e-6)
-n val Maximum number of features
-o Find overlap (incompatible with -f)
-r val Correlation exclusion threshold (default: 0.9)
Where ‘dataset’ is a data set in TADM format minus the optional header line.
To do
- Test with more datasets, the program currently fails on some datasets that are very much different from mine.
License
This application is free software; you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation; either version 2.1 of the License, or (at your option) any later version. This license is included in the file COPYING in the top-level directory.
References
- A maximum entropy approach to natural language processing, Adam L. Berger, Vincent J. Della Pietra, Stephen A. Della Pietra, Computational Linguistics, Volume 22, Issue 1, March 1996
- A fast algorithm for feature selection in conditional maximum entropy modeling, Yaqian Zhou, Fuliang Weng, Lide Wu, Hauke Schmidt, Theoretical Issues In Natural Language Processing, Proceedings of the 2003 conference on Empirical methods in natural language processing, Volume 10