r performance n-gram stop-words text-analysis

Cómo eliminar las palabras clave de manera eficiente de una lista de tokens ngram en R



performance n-gram (3)

Esto no es realmente una respuesta, sino más bien un comentario para responder al comentario de rawr de revisar todas las combinaciones de palabras clave. Con una lista de stopwords más stopwords , el uso de algo así como %in% no parece sufrir ese problema de dimensionalidad.

library(purrr) removetokenstst <- function(tokens, stopwords) map2(tokens, lapply(tokens3, function(x) { unlist(lapply(strsplit(x, "_"), function(y) { any(y %in% stopwords) })) }), ~ .x[!.y]) require(microbenchmark) microbenchmark(OP1_1 = removeTokensOP1(tokens1, morestopwords), OP2_1 = removeTokensOP2(tokens1, morestopwords), OP2_2 = removeTokensOP2(tokens2, morestopwords), OP2_3 = removeTokensOP2(tokens3, morestopwords), Ak_3 = removetokenstst(tokens3, stopwords), Ak_3msw = removetokenstst(tokens3, morestopwords), unit = "relative") Unit: relative expr min lq mean median uq max neval OP1_1 1.00000 1.00000 1.000000 1.000000 1.000000 1.00000 100 OP2_1 278.48260 176.22273 96.462854 79.787932 76.904987 38.31767 100 OP2_2 280.90242 181.22013 98.545148 81.407928 77.637006 64.94842 100 OP2_3 279.43728 183.11366 114.879904 81.404236 82.614739 72.04741 100 Ak_3 15.74301 14.83731 9.340444 7.902213 8.164234 11.27133 100 Ak_3msw 18.57697 14.45574 12.936594 8.513725 8.997922 24.03969 100

Para las palabras

morestopwords = c("a", "about", "above", "after", "again", "against", "all", "am", "an", "and", "any", "are", "arent", "as", "at", "be", "because", "been", "before", "being", "below", "between", "both", "but", "by", "cant", "cannot", "could", "couldnt", "did", "didnt", "do", "does", "doesnt", "doing", "dont", "down", "during", "each", "few", "for", "from", "further", "had", "hadnt", "has", "hasnt", "have", "havent", "having", "he", "hed", "hell", "hes", "her", "here", "heres", "hers", "herself", "him", "himself", "his", "how", "hows", "i", "id", "ill", "im", "ive", "if", "in", "into", "is", "isnt", "it", "its", "its", "itself", "lets", "me", "more", "most", "mustnt", "my", "myself", "no", "nor", "not", "of", "off", "on", "once", "only", "or", "other", "ought", "our", "ours", "ourselves", "out", "over", "own", "same", "shant", "she", "shed", "shell", "shes", "should", "shouldnt", "so", "some", "such", "than", "that", "thats", "the", "their", "theirs", "them", "themselves", "then", "there", "theres", "these", "they", "theyd", "theyll", "theyre", "theyve", "this", "those", "through", "to", "too", "under", "until", "up", "very", "was", "wasnt", "we", "wed", "well", "were", "weve", "were", "werent", "what", "whats", "when", "whens", "where", "wheres", "which", "while", "who", "whos", "whom", "why", "whys", "with", "wont", "would", "wouldnt", "you", "youd", "youll", "youre", "youve", "your", "yours", "yourself", "yourselves", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z")

A continuación, se solicita una mejor manera de hacer algo que ya puedo hacer de manera ineficiente: filtre una serie de tokens de n-gramas con "palabras de parada" para que la aparición de cualquier término de palabra de parada en una n-gram desencadene la eliminación.

Me gustaría mucho tener una solución que funcione tanto para unigramas como para n-gramas, aunque estaría bien tener dos versiones, una con una bandera "fija" y otra con una bandera de "regex". Estoy poniendo los dos aspectos de la pregunta juntos, ya que alguien puede tener una solución que pruebe un enfoque diferente que aborde los patrones de palabras de parada de expresión regular y fija.

Formatos:

  • los tokens son una lista de vectores de caracteres, que pueden ser unigramas, o n-gramos concatenados por un carácter _ (subrayado).

  • Las palabras clave son un vector de caracteres. En este momento estoy contento de permitir que esto sea una cadena fija, pero sería una buena ventaja poder implementar esto usando palabras clave con formato de expresiones regulares también.

Salida deseada: una lista de caracteres que coinciden con los tokens de entrada pero con cualquier componente que coincida con una palabra de parada que se está eliminando. (Esto significa una coincidencia de unigram o una coincidencia con uno de los términos que comprende el n-gramo).

Ejemplos, datos de prueba, código de trabajo y puntos de referencia para desarrollar:

tokens1 <- list(text1 = c("this", "is", "a", "test", "text", "with", "a", "few", "words"), text2 = c("some", "more", "words", "in", "this", "test", "text")) tokens2 <- list(text1 = c("this_is", "is_a", "a_test", "test_text", "text_with", "with_a", "a_few", "few_words"), text2 = c("some_more", "more_words", "words_in", "in_this", "this_text", "text_text")) tokens3 <- list(text1 = c("this_is_a", "is_a_test", "a_test_text", "test_text_with", "text_with_a", "with_a_few", "a_few_words"), text2 = c("some_more_words", "more_words_in", "words_in_this", "in_this_text", "this_text_text")) stopwords <- c("is", "a", "in", "this") # remove any single token that matches a stopword removeTokensOP1 <- function(w, stopwords) { lapply(w, function(x) x[-which(x %in% stopwords)]) } # remove any word pair where a single word contains a stopword removeTokensOP2 <- function(w, stopwords) { matchPattern <- paste0("(^|_)", paste(stopwords, collapse = "(_|$)|(^|_)"), "(_|$)") lapply(w, function(x) x[-grep(matchPattern, x)]) } removeTokensOP1(tokens1, stopwords) ## $text1 ## [1] "test" "text" "with" "few" "words" ## ## $text2 ## [1] "some" "more" "words" "test" "text" removeTokensOP2(tokens1, stopwords) ## $text1 ## [1] "test" "text" "with" "few" "words" ## ## $text2 ## [1] "some" "more" "words" "test" "text" removeTokensOP2(tokens2, stopwords) ## $text1 ## [1] "test_text" "text_with" "few_words" ## ## $text2 ## [1] "some_more" "more_words" "text_text" removeTokensOP2(tokens3, stopwords) ## $text1 ## [1] "test_text_with" ## ## $text2 ## [1] "some_more_words" # performance benchmarks for answers to build on require(microbenchmark) microbenchmark(OP1_1 = removeTokensOP1(tokens1, stopwords), OP2_1 = removeTokensOP2(tokens1, stopwords), OP2_2 = removeTokensOP2(tokens2, stopwords), OP2_3 = removeTokensOP2(tokens3, stopwords), unit = "relative") ## Unit: relative ## expr min lq mean median uq max neval ## OP1_1 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 100 ## OP2_1 5.119066 3.812845 3.438076 3.714492 3.547187 2.838351 100 ## OP2_2 5.230429 3.903135 3.509935 3.790143 3.631305 2.510629 100 ## OP2_3 5.204924 3.884746 3.578178 3.753979 3.553729 8.240244 100


Podemos mejorar el lapply si tiene muchos niveles en su lista usando el paquete parallel .

Crear muchos niveles

tokens2 <- list(text1 = c("this_is", "is_a", "a_test", "test_text", "text_with", "with_a", "a_few", "few_words"), text2 = c("some_more", "more_words", "words_in", "in_this", "this_text", "text_text")) tokens2 <- lapply(1:500,function(x) sample(tokens2,1)[[1]])

Hacemos esto porque el paquete paralelo tiene una gran cantidad de gastos generales que configurar, por lo que el aumento del número de iteraciones en microbenchmark continuará incurriendo en ese costo. Al aumentar el tamaño de la lista, verá la verdadera mejora.

library(parallel) #Setup cl <- detectCores() cl <- makeCluster(cl) #Two functions: #original removeTokensOP2 <- function(w, stopwords) { matchPattern <- paste0("(^|_)", paste(stopwords, collapse = "(_|$)|(^|_)"), "(_|$)") lapply(w, function(x) x[-grep(matchPattern, x)]) } #new removeTokensOPP <- function(w, stopwords) { matchPattern <- paste0("(^|_)", paste(stopwords, collapse = "(_|$)|(^|_)"), "(_|$)") return(w[-grep(matchPattern, w)]) } #compare microbenchmark( OP2_P = parLapply(cl,tokens2,removeTokensOPP,stopwords), OP2_2 = removeTokensOP2(tokens2, stopwords), unit = ''relative'' ) Unit: relative expr min lq mean median uq max neval OP2_P 1.000000 1.000000 1.000000 1.000000 1.000000 1.00000 100 OP2_2 1.730565 1.653872 1.678781 1.562258 1.471347 10.11306 100

A medida que aumente el número de niveles en su lista, el rendimiento mejorará.


Usted podría considerar la posibilidad de simplificar sus expresiones regulares, ^ y $ se están sumando a la sobrecarga

remove_short <- function(x, stopwords) { stopwords_regexp <- paste0(''(^|_)('', paste(stopwords, collapse = ''|''), '')(_|$)'') lapply(x, function(x) x[!grepl(stopwords_regexp, x)]) } require(microbenchmark) microbenchmark(OP1_1 = removeTokensOP1(tokens1, stopwords), OP2_1 = removeTokensOP2(tokens2, stopwords), OP2_2 = remove_short(tokens2, stopwords), unit = "relative") Unit: relative expr min lq mean median uq max neval cld OP1_1 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 100 a OP2_1 5.178565 4.768749 4.465138 4.441130 4.262399 4.266905 100 c OP2_2 3.452386 3.247279 3.063660 3.068571 2.963794 2.948189 100 b