the - StanfordNLP-ArrayIndexOutOfBoundsException en TokensRegexNERAnnotator.readEntries(TokensRegexNERAnnotator.java:696)

stanford nlp github (2)

Más arriba aceptada Answer by @StanfordNLPHelp, me ayudó a resolver este problema. Todo el crédito va a él / ella.

Estoy concluyendo cómo se vería el código final para obtener el formato deseado con la esperanza de que ayude a alguien.

Primero cambié en el archivo de reglas

Luego en el código

props.put("annotators", "tokenize, ssplit, pos, lemma, ner"); StanfordCoreNLP pipeline = new StanfordCoreNLP(props); for (String txt : tests) { System.out.println("String is : " + txt); // create an empty Annotation just with the given text Annotation document = new Annotation(txt); pipeline.annotate(document); List<CoreMap> sentences = document.get(SentencesAnnotation.class); Env env = TokenSequencePattern.getNewEnv(); env.setDefaultStringMatchFlags(NodePattern.CASE_INSENSITIVE); env.setDefaultStringPatternFlags(Pattern.CASE_INSENSITIVE); CoreMapExpressionExtractor extractor = CoreMapExpressionExtractor.createExtractorFromFiles(env, "test_degree.rules"); for (CoreMap sentence : sentences) { List<MatchedExpression> matched = extractor.extractExpressions(sentence); for(MatchedExpression phrase : matched){ // Print out matched text and value System.out.println("MATCHED ENTITY: " + phrase.getText() + " VALUE: " + phrase.getValue().get()); } } }

Deseo identificar lo siguiente como SKILL usando TokensRegexNERAnnotator de stanfordNLP.

AREAS OF EXPERTISE Areas of Knowledge Computer Skills Technical Experience Technical Skills

Hay muchas más secuencias de texto como arriba.

Código -

Properties props = new Properties(); props.put("annotators", "tokenize, ssplit, pos, lemma, ner"); StanfordCoreNLP pipeline = new StanfordCoreNLP(props); pipeline.addAnnotator(new TokensRegexNERAnnotator("./mapping/test_degree.rule", true)); String[] tests = {"Bachelor of Arts is a good degree.", "Technical Skill is a must have for Software Developer."}; List tokens = new ArrayList<>(); // traversing each sentence from array of sentence. for (String txt : tests) { System.out.println("String is : " + txt); // create an empty Annotation just with the given text Annotation document = new Annotation(txt); pipeline.annotate(document); List<CoreMap> sentences = document.get(SentencesAnnotation.class); /* Next we can go over the annotated sentences and extract the annotated words, Using the CoreLabel Object */ for (CoreMap sentence : sentences) { for (CoreLabel token : sentence.get(TokensAnnotation.class)) { System.out.println("annotated coreMap sentences : " + token); // Extracting NER tag for current token String ne = token.get(NamedEntityTagAnnotation.class); String word = token.get(CoreAnnotations.TextAnnotation.class); System.out.println("Current Word : " + word + " POS :" + token.get(PartOfSpeechAnnotation.class)); System.out.println("Lemma : " + token.get(LemmaAnnotation.class)); System.out.println("Named Entity : " + ne); } }

Mi archivo de reglas regex es -

tokens = {type: "CLASS", valor: "edu.stanford.nlp.ling.CoreAnnotaciones $ TokensAnnotation"}

{ruleType: "tokens", patrón: ($ SKILL_FIRST_KEYWORD + $ SKILL_KEYWORD), resultado: "SKILL"}

ArrayIndexOutOfBoundsException error ArrayIndexOutOfBoundsException . Supongo que hay algo mal con mi archivo de reglas. ¿Puede alguien señalarme dónde estoy cometiendo un error?

Salida deseada -

ÁREAS DE EXPERIENCIA - HABILIDAD

Áreas de conocimiento - HABILIDAD

Habilidades informáticas - HABILIDAD

y así.

Gracias por adelantado.

Debería utilizar TokensRegexAnnotator y no TokensRegexNERAnnotator.

Debe revisar estos hilos para obtener más información:

Reglas de TokensRegex para obtener el resultado correcto para las entidades con nombre

Obteniendo salida en el formato deseado usando TokenRegex