Sistema de Información Científica Redalyc
Red de Revistas Científicas de América Latina y el Caribe, España y Portugal
Named Entity Recognition in Hindi using Maximum Entropy and Transliteration
Sujan Kumar Saha, Partha Sarathi Ghosh, Sudeshna Sarkar, Pabitra Mitra;
Polibits 2008 38
Named entities are perhaps the most importantindexing element in text for most of the information extractionand mining tasks. Construction of a Named Entity Recognition(NER) system becomes challenging if proper resources are notavailable. Gazetteer lists are often used for the development ofNER systems. In many resource-poor languages gazetteer lists ofproper size are not available, but sometimes relevant lists areavailable in English. Proper transliteration makes the Englishlists useful in the NER tasks for such languages. In this paper, wehave described a Maximum Entropy based NER system forHindi. We have explored different features applicable for theHindi NER task. We have incorporated some gazetteer lists inthe system to increase the performance of the system. These listsare collected from the web and are in English. To make theseEnglish lists useful in the Hindi NER task, we have proposed atwo-phase transliteration methodology. A considerable amountof performance improvement is observed after using thetransliteration based gazetteer lists in the system. The proposedtransliteration based gazetteer preparation methodology is alsoapplicable for other languages. Apart from Hindi, we haveapplied the transliteration approach in Bengali NER task andalso achieved performance improvement.

Palabras clave: Gazetteer list preparation, named entityre cognition, natural language processing, transliteration.
Ver Resumen
Universidad Autónoma del Estado de México
Sistema de Información Científica Redalyc ®
Versión 3.0 | 2017