nlp - spanish - Entidades nombradas multi-término en Stanford Named Entity Recognizer

named entity recognition spanish (8)

Estoy usando el Reconocedor de http://nlp.stanford.edu/software/CRF-NER.shtml nombre de Stanford http://nlp.stanford.edu/software/CRF-NER.shtml y está funcionando bien. Esto es

List<List<CoreLabel>> out = classifier.classify(text); for (List<CoreLabel> sentence : out) { for (CoreLabel word : sentence) { if (!StringUtils.equals(word.get(AnswerAnnotation.class), "O")) { namedEntities.add(word.word().trim()); } } }

Sin embargo, el problema que estoy encontrando es identificar nombres y apellidos. Si el reconocedor se encuentra con "Joe Smith", está devolviendo "Joe" y "Smith" por separado. Realmente me gustaría que regrese "Joe Smith" como un solo término.

¿Podría lograrse esto a través del reconocedor tal vez a través de una configuración? No encontré nada en el javadoc hasta ahora.

¡Gracias!

Aquí está mi código completo, uso el NLP central de Stanford y el algoritmo de escritura para concatenar los nombres de Multi Term.

import edu.stanford.nlp.ling.CoreAnnotations; import edu.stanford.nlp.ling.CoreLabel; import edu.stanford.nlp.pipeline.Annotation; import edu.stanford.nlp.pipeline.StanfordCoreNLP; import edu.stanford.nlp.util.CoreMap; import org.apache.log4j.Logger; import java.util.ArrayList; import java.util.List; import java.util.Properties; /** * Created by Chanuka on 8/28/14 AD. */ public class FindNameEntityTypeExecutor { private static Logger logger = Logger.getLogger(FindNameEntityTypeExecutor.class); private StanfordCoreNLP pipeline; public FindNameEntityTypeExecutor() { logger.info("Initializing Annotator pipeline ..."); Properties props = new Properties(); props.setProperty("annotators", "tokenize, ssplit, pos, lemma, ner"); pipeline = new StanfordCoreNLP(props); logger.info("Annotator pipeline initialized"); } List<String> findNameEntityType(String text, String entity) { logger.info("Finding entity type matches in the " + text + " for entity type, " + entity); // create an empty Annotation just with the given text Annotation document = new Annotation(text); // run all Annotators on this text pipeline.annotate(document); List<CoreMap> sentences = document.get(CoreAnnotations.SentencesAnnotation.class); List<String> matches = new ArrayList<String>(); for (CoreMap sentence : sentences) { int previousCount = 0; int count = 0; // traversing the words in the current sentence // a CoreLabel is a CoreMap with additional token-specific methods for (CoreLabel token : sentence.get(CoreAnnotations.TokensAnnotation.class)) { String word = token.get(CoreAnnotations.TextAnnotation.class); int previousWordIndex; if (entity.equals(token.get(CoreAnnotations.NamedEntityTagAnnotation.class))) { count++; if (previousCount != 0 && (previousCount + 1) == count) { previousWordIndex = matches.size() - 1; String previousWord = matches.get(previousWordIndex); matches.remove(previousWordIndex); previousWord = previousWord.concat(" " + word); matches.add(previousWordIndex, previousWord); } else { matches.add(word); } previousCount = count; } else { count=0; previousCount=0; } } } return matches; } }

Código para lo anterior:

<List> result = classifier.classifyToCharacterOffsets(text); for (Triple<String, Integer, Integer> triple : result) { System.out.println(triple.first + " : " + text.substring(triple.second, triple.third)); }

Esto se debe a que su bucle for interno está iterando sobre tokens individuales (palabras) y los agrega por separado. Necesitas cambiar las cosas para agregar nombres completos a la vez.

Una forma es reemplazar el bucle interno interno por uno normal por un bucle while que toma elementos adyacentes que no son O de la misma clase y los agrega como una sola entidad. *

Otra forma sería usar la llamada al método CRFClassifier:

List<Triple<String,Integer,Integer>> classifyToCharacterOffsets(String sentences)

que le dará entidades completas, de las que puede extraer la forma de Cadena usando substring en la entrada original.

* Los modelos que distribuimos utilizan un esquema simple de etiquetas de E / S sin procesar, donde las cosas están etiquetadas como PERSONA o UBICACIÓN, y lo apropiado es simplemente fusionar las fichas adyacentes con la misma etiqueta. Muchos sistemas NER utilizan etiquetas más complejas, como las etiquetas IOB, donde los códigos como B-PERS indican dónde comienza una entidad personal. La clase de CRFC y las fábricas de características son compatibles con dichas etiquetas, pero no se utilizan en los modelos que distribuimos actualmente (a partir de 2012).

Haz uso de los clasificadores que ya te hemos proporcionado. Creo que esto es lo que estás buscando:

private static String combineNERSequence(String text) { String serializedClassifier = "edu/stanford/nlp/models/ner/english.all.3class.distsim.crf.ser.gz"; AbstractSequenceClassifier<CoreLabel> classifier = null; try { classifier = CRFClassifier .getClassifier(serializedClassifier); } catch (ClassCastException e) { // TODO Auto-generated catch block e.printStackTrace(); } catch (ClassNotFoundException e) { // TODO Auto-generated catch block e.printStackTrace(); } catch (IOException e) { // TODO Auto-generated catch block e.printStackTrace(); } System.out.println(classifier.classifyWithInlineXML(text)); // FOR TSV FORMAT // //System.out.print(classifier.classifyToString(text, "tsv", false)); return classifier.classifyWithInlineXML(text); }

La contraparte del método classifyToCharacterOffsets es que (AFAIK) no puede acceder a la etiqueta de las entidades.

Según lo propuesto por Christopher, aquí hay un ejemplo de un bucle que ensambla "cosas adyacentes que no son O". Este ejemplo también cuenta el número de ocurrencias.

public HashMap<String, HashMap<String, Integer>> extractEntities(String text){ HashMap<String, HashMap<String, Integer>> entities = new HashMap<String, HashMap<String, Integer>>(); for (List<CoreLabel> lcl : classifier.classify(text)) { Iterator<CoreLabel> iterator = lcl.iterator(); if (!iterator.hasNext()) continue; CoreLabel cl = iterator.next(); while (iterator.hasNext()) { String answer = cl.getString(CoreAnnotations.AnswerAnnotation.class); if (answer.equals("O")) { cl = iterator.next(); continue; } if (!entities.containsKey(answer)) entities.put(answer, new HashMap<String, Integer>()); String value = cl.getString(CoreAnnotations.ValueAnnotation.class); while (iterator.hasNext()) { cl = iterator.next(); if (answer.equals( cl.getString(CoreAnnotations.AnswerAnnotation.class))) value = value + " " + cl.getString(CoreAnnotations.ValueAnnotation.class); else { if (!entities.get(answer).containsKey(value)) entities.get(answer).put(value, 0); entities.get(answer).put(value, entities.get(answer).get(value) + 1); break; } } if (!iterator.hasNext()) break; } } return entities; }

Otro enfoque para tratar con entidades de múltiples palabras. Este código combina varios tokens juntos si tienen la misma anotación y van en una fila.

Restricción:
Si el mismo token tiene dos anotaciones diferentes, se guardará la última.

private Document getEntities(String fullText) { Document entitiesList = new Document(); NERClassifierCombiner nerCombClassifier = loadNERClassifiers(); if (nerCombClassifier != null) { List<List<CoreLabel>> results = nerCombClassifier.classify(fullText); for (List<CoreLabel> coreLabels : results) { String prevLabel = null; String prevToken = null; for (CoreLabel coreLabel : coreLabels) { String word = coreLabel.word(); String annotation = coreLabel.get(CoreAnnotations.AnswerAnnotation.class); if (!"O".equals(annotation)) { if (prevLabel == null) { prevLabel = annotation; prevToken = word; } else { if (prevLabel.equals(annotation)) { prevToken += " " + word; } else { prevLabel = annotation; prevToken = word; } } } else { if (prevLabel != null) { entitiesList.put(prevToken, prevLabel); prevLabel = null; } } } } } return entitiesList; }

Importaciones:

Document: org.bson.Document; NERClassifierCombiner: edu.stanford.nlp.ie.NERClassifierCombiner;

Tuve el mismo problema, así que también lo busqué. El método propuesto por Christopher Manning es eficiente, pero el punto delicado es saber cómo decidir qué tipo de separador es el adecuado. Se podría decir que solo se debe permitir un espacio, por ejemplo, "John Zorn" >> una entidad. Sin embargo, puedo encontrar la forma "J.Zorn", por lo que también debo permitir ciertos signos de puntuación. Pero ¿qué pasa con "Jack, James y Joe"? Podría obtener 2 entidades en lugar de 3 ("Jack James" y "Joe").

Al excavar un poco en las clases NER de Stanford, encontré una implementación adecuada de esta idea. Lo utilizan para exportar entidades bajo la forma de objetos String individuales. Por ejemplo, en el método PlainTextDocumentReaderAndWriter.printAnswersTokenizedInlineXML , tenemos:

private void printAnswersInlineXML(List<IN> doc, PrintWriter out) { final String background = flags.backgroundSymbol; String prevTag = background; for (Iterator<IN> wordIter = doc.iterator(); wordIter.hasNext();) { IN wi = wordIter.next(); String tag = StringUtils.getNotNullString(wi.get(AnswerAnnotation.class)); String before = StringUtils.getNotNullString(wi.get(BeforeAnnotation.class)); String current = StringUtils.getNotNullString(wi.get(CoreAnnotations.OriginalTextAnnotation.class)); if (!tag.equals(prevTag)) { if (!prevTag.equals(background) && !tag.equals(background)) { out.print("</"); out.print(prevTag); out.print(''>''); out.print(before); out.print(''<''); out.print(tag); out.print(''>''); } else if (!prevTag.equals(background)) { out.print("</"); out.print(prevTag); out.print(''>''); out.print(before); } else if (!tag.equals(background)) { out.print(before); out.print(''<''); out.print(tag); out.print(''>''); } } else { out.print(before); } out.print(current); String afterWS = StringUtils.getNotNullString(wi.get(AfterAnnotation.class)); if (!tag.equals(background) && !wordIter.hasNext()) { out.print("</"); out.print(tag); out.print(''>''); prevTag = background; } else { prevTag = tag; } out.print(afterWS); } }

Recorren cada palabra, verificando si tiene la misma clase (respuesta) que la anterior, como se explicó anteriormente. Para esto, aprovechan el hecho de que las expresiones consideradas como no entidades se marcan utilizando el llamado backgroundSymbol (clase "O"). También usan la propiedad BeforeAnnotation , que representa la cadena que separa la palabra actual de la anterior. Este último punto permite resolver el problema que inicialmente planteé, con respecto a la elección de un separador apropiado.

List<List<CoreLabel>> out = classifier.classify(text); for (List<CoreLabel> sentence : out) { String s = ""; String prevLabel = null; for (CoreLabel word : sentence) { if(prevLabel == null || prevLabel.equals(word.get(CoreAnnotations.AnswerAnnotation.class)) ) { s = s + " " + word; prevLabel = word.get(CoreAnnotations.AnswerAnnotation.class); } else { if(!prevLabel.equals("O")) System.out.println(s.trim() + ''/'' + prevLabel + '' ''); s = " " + word; prevLabel = word.get(CoreAnnotations.AnswerAnnotation.class); } } if(!prevLabel.equals("O")) System.out.println(s + ''/'' + prevLabel + '' ''); }

Acabo de escribir una pequeña lógica y está funcionando bien. Lo que hice es agrupar palabras con la misma etiqueta si son adyacentes.