what gramas gram caracteres java lucene nlp n-gram

java - what - n gramas de caracteres



GeneraciĆ³n de N-gram a partir de una oraciĆ³n (6)

Creo que esto haría lo que quieras:

import java.util.*; public class Test { public static List<String> ngrams(int n, String str) { List<String> ngrams = new ArrayList<String>(); String[] words = str.split(" "); for (int i = 0; i < words.length - n + 1; i++) ngrams.add(concat(words, i, i+n)); return ngrams; } public static String concat(String[] words, int start, int end) { StringBuilder sb = new StringBuilder(); for (int i = start; i < end; i++) sb.append((i > start ? " " : "") + words[i]); return sb.toString(); } public static void main(String[] args) { for (int n = 1; n <= 3; n++) { for (String ngram : ngrams(n, "This is my car.")) System.out.println(ngram); System.out.println(); } } }

Salida:

This is my car. This is is my my car. This is my is my car.

Una solución "a pedido" implementada como un iterador:

class NgramIterator implements Iterator<String> { String[] words; int pos = 0, n; public NgramIterator(int n, String str) { this.n = n; words = str.split(" "); } public boolean hasNext() { return pos < words.length - n + 1; } public String next() { StringBuilder sb = new StringBuilder(); for (int i = pos; i < pos + n; i++) sb.append((i > pos ? " " : "") + words[i]); pos++; return sb.toString(); } public void remove() { throw new UnsupportedOperationException(); } }

Cómo generar un n-gramo de una cadena como:

String Input="This is my car."

Quiero generar n-gram con esta entrada:

Input Ngram size = 3

La salida debería ser:

This is my car This is is my my car This is my is my car

Da una idea en Java, cómo implementar eso o si hay alguna biblioteca disponible para él.

Estoy tratando de usar este NGramTokenizer pero está dando una secuencia de n-grams de caracteres y quiero n-grams de secuencia de palabras.


Está buscando ShingleFilter .

Actualización: el enlace apunta a la versión 3.0.2. Esta clase puede estar en un paquete diferente en la versión más nueva de Lucene.


Este código devuelve una matriz de todas las cadenas de la longitud dada:

public static String[] ngrams(String s, int len) { String[] parts = s.split(" "); String[] result = new String[parts.length - len + 1]; for(int i = 0; i < parts.length - len + 1; i++) { StringBuilder sb = new StringBuilder(); for(int k = 0; k < len; k++) { if(k > 0) sb.append('' ''); sb.append(parts[i+k]); } result[i] = sb.toString(); } return result; }

P.ej

System.out.println(Arrays.toString(ngrams("This is my car", 2))); //--> [This is, is my, my car] System.out.println(Arrays.toString(ngrams("This is my car", 3))); //--> [This is my, is my car]


public static void CreateNgram(ArrayList<String> list, int cutoff) { try { NGramModel ngramModel = new NGramModel(); POSModel model = new POSModelLoader().load(new File("en-pos-maxent.bin")); PerformanceMonitor perfMon = new PerformanceMonitor(System.err, "sent"); POSTaggerME tagger = new POSTaggerME(model); perfMon.start(); for(int i = 0; i<list.size(); i++) { String inputString = list.get(i); ObjectStream<String> lineStream = new PlainTextByLineStream(new StringReader(inputString)); String line; while ((line = lineStream.read()) != null) { String whitespaceTokenizerLine[] = WhitespaceTokenizer.INSTANCE.tokenize(line); String[] tags = tagger.tag(whitespaceTokenizerLine); POSSample sample = new POSSample(whitespaceTokenizerLine, tags); perfMon.incrementCounter(); String words[] = sample.getSentence(); if(words.length > 0) { for(int k = 2; k< 4; k++) { ngramModel.add(new StringList(words), k, k); } } } } ngramModel.cutoff(cutoff, Integer.MAX_VALUE); Iterator<StringList> it = ngramModel.iterator(); while(it.hasNext()) { StringList strList = it.next(); System.out.println(strList.toString()); } perfMon.stopAndPrintFinalResult(); }catch(Exception e) { System.out.println(e.toString()); } }

Aquí están mis códigos para crear n-gram. En este caso, n = 2, 3. secuencia de n-gramas de palabras que menor que el valor de corte ignorará el conjunto de resultados. La entrada es una lista de oraciones, luego se analiza utilizando una herramienta de OpenNLP


/** * * @param sentence should has at least one string * @param maxGramSize should be 1 at least * @return set of continuous word n-grams up to maxGramSize from the sentence */ public static List<String> generateNgramsUpto(String str, int maxGramSize) { List<String> sentence = Arrays.asList(str.split("[//W+]")); List<String> ngrams = new ArrayList<String>(); int ngramSize = 0; StringBuilder sb = null; //sentence becomes ngrams for (ListIterator<String> it = sentence.listIterator(); it.hasNext();) { String word = (String) it.next(); //1- add the word itself sb = new StringBuilder(word); ngrams.add(word); ngramSize=1; it.previous(); //2- insert prevs of the word and add those too while(it.hasPrevious() && ngramSize<maxGramSize){ sb.insert(0,'' ''); sb.insert(0,it.previous()); ngrams.add(sb.toString()); ngramSize++; } //go back to initial position while(ngramSize>0){ ngramSize--; it.next(); } } return ngrams; }

Llamada:

long startTime = System.currentTimeMillis(); ngrams = ToolSet.generateNgramsUpto("This is my car.", 3); long stopTime = System.currentTimeMillis(); System.out.println("My time = "+(stopTime-startTime)+" ms with ngramsize = "+ngrams.size()); System.out.println(ngrams.toString());

Salida:

Mi tiempo = 1 ms con ngramsize = 9 [This, is, This is, my, is my, This is my, car, my car, is my car]


public static void main(String[] args) { String[] words = "This is my car.".split(" "); for (int n = 0; n < 3; n++) { List<String> list = ngrams(n, words); for (String ngram : list) { System.out.println(ngram); } System.out.println(); } } public static List<String> ngrams(int stepSize, String[] words) { List<String> ngrams = new ArrayList<String>(); for (int i = 0; i < words.length-stepSize; i++) { String initialWord = ""; int internalCount = i; int internalStepSize = i + stepSize; while (internalCount <= internalStepSize && internalCount < words.length) { initialWord = initialWord+" " + words[internalCount]; ++internalCount; } ngrams.add(initialWord); } return ngrams; }