texto - vectorsource r

Error DocumentTermMatrix en el argumento de Corpus (4)

Tengo el siguiente código:

# returns string w/o leading or trailing whitespace trim <- function (x) gsub("^//s+|//s+$", "", x) news_corpus <- Corpus(VectorSource(news_raw$text)) # a column of strings. corpus_clean <- tm_map(news_corpus, tolower) corpus_clean <- tm_map(corpus_clean, removeNumbers) corpus_clean <- tm_map(corpus_clean, removeWords, stopwords(''english'')) corpus_clean <- tm_map(corpus_clean, removePunctuation) corpus_clean <- tm_map(corpus_clean, stripWhitespace) corpus_clean <- tm_map(corpus_clean, trim) news_dtm <- DocumentTermMatrix(corpus_clean) # errors here

Cuando ejecuto el método DocumentTermMatrix() , me da este error:

Error: hereda (doc, "TextDocument") no es VERDADERO

¿Por qué me sale este error? ¿Mis filas no son documentos de texto?

Aquí está la salida al inspeccionar corpus_clean :

[[153]] [1] obama holds technical school model us [[154]] [1] oil boom produces jobs bonanza archaeologists [[155]] [1] islamic terrorist group expands territory captures tikrit [[156]] [1] republicans democrats feel eric cantors loss [[157]] [1] tea party candidates try build cantor loss [[158]] [1] vehicles materials stored delaware bridges [[159]] [1] hill testimony hagel defends bergdahl trade [[160]] [1] tweet selfpropagates tweetdeck [[161]] [1] blackwater guards face trial iraq shootings [[162]] [1] calif man among soldiers killed afghanistan [[163]] [1] stocks fall back world bank cuts growth outlook [[164]] [1] jabhat alnusra longer useful turkey [[165]] [1] catholic bishops keep focus abortion marriage [[166]] [1] barbra streisand visits hill heart disease [[167]] [1] rand paul cantors loss reason stop talking immigration [[168]] [1] israeli airstrike kills northern gaza

Edición: Aquí están mis datos:

type,text neutral,The week in 32 photos neutral,Look at me! 22 selfies of the week neutral,Inside rebel tunnels in Homs neutral,Voices from Ukraine neutral,Water dries up ahead of World Cup positive,Who''s your hero? Nominate them neutral,Anderson Cooper: Here''s how positive,"At fire scene, she rescues the pet" neutral,Hunger in the land of plenty positive,Helping women escape ''the life'' neutral,A tour of the sex underworld neutral,Miss Universe Thailand steps down neutral,China''s ''naked officials'' crackdown negative,More held over Pakistan stoning neutral,Watch landmark Cold War series neutral,In photos: History of the Cold War neutral,Turtle predicts World Cup winner neutral,What devoured great white? positive,Nun wins Italy''s ''The Voice'' neutral,Bride Price app sparks debate neutral,China to deport ''pork'' artist negative,Lightning hits moving car neutral,Singer won''t be silenced neutral,Poland''s mini desert neutral,When monarchs retire negative,Murder on Street View? positive,Meet armless table tennis champ neutral,Incredible 400 year-old globes positive,Man saves falling baby neutral,World''s most controversial foods

Que recupero como

news_raw <- read.csv(''news_csv.csv'', stringsAsFactors = F)

Edición: Aquí está la traza ():

> news_dtm <- DocumentTermMatrix(corpus_clean) Error: inherits(doc, "TextDocument") is not TRUE > traceback() 9: stop(sprintf(ngettext(length(r), "%s is not TRUE", "%s are not all TRUE"), ch), call. = FALSE, domain = NA) 8: stopifnot(inherits(doc, "TextDocument"), is.list(control)) 7: FUN(X[[1L]], ...) 6: lapply(X, FUN, ...) 5: mclapply(unname(content(x)), termFreq, control) 4: TermDocumentMatrix.VCorpus(x, control) 3: TermDocumentMatrix(x, control) 2: t(TermDocumentMatrix(x, control)) 1: DocumentTermMatrix(corpus_clean)

Cuando evalúo inherits(corpus_clean, "TextDocument") es FALSO.

Cambia esto:

corpus_clean <- tm_map(news_corpus, tolower)

Para esto:

corpus_clean <- tm_map(news_corpus, content_transformer(tolower))

Esto debería funcionar.

remove.packages(tm) install.packages("http://cran.r-project.org/bin/windows/contrib/3.0/tm_0.5-10.zip",repos=NULL) library(tm)

He encontrado una manera de resolver este problema en un artículo sobre TM.

Un ejemplo en el que el error sigue a continuación:

getwd() require(tm) files <- DirSource(directory="texts/", encoding="latin1") # import files corpus <- VCorpus(x=files) # load files, create corpus summary(corpus) # get a summary corpus <- tm_map(corpus,removePunctuation) corpus <- tm_map(corpus,stripWhitespace) corpus <- tm_map(corpus,removePunctuation); matrix_terms <- DocumentTermMatrix(corpus)

Mensajes de advertencia:

En TermDocumentMatrix.VCorpus (x, control): identificadores de documento no válidos

Este error se produce porque necesita un objeto de la clase Fuente del vector para hacer su Matriz de documentos a largo plazo, pero las transformaciones anteriores transforman su cuerpo de textos en caracteres, por lo tanto, cambiando una clase que la función no acepta.

Sin embargo, si agrega la función content_transformer dentro del comando tm_map, es posible que no necesite ni un comando más antes de usar la función TermDocumentMatrix para continuar.

El siguiente código cambia la clase (ver la segunda última línea) y evita el error:

getwd() require(tm) files <- DirSource(directory="texts/", encoding="latin1") corpus <- VCorpus(x=files) # load files, create corpus summary(corpus) # get a summary corpus <- tm_map(corpus,content_transformer(removePunctuation)) corpus <- tm_map(corpus,content_transformer(stripWhitespace)) corpus <- tm_map(corpus,content_transformer(removePunctuation)) corpus <- Corpus(VectorSource(corpus)) # change class matrix_term <- DocumentTermMatrix(corpus)

Parece que esto hubiera funcionado bien en tm 0.5.10 pero los cambios en tm 0.6.0 parecen haberlo roto. El problema es que las funciones tolower y trim no necesariamente devolverán TextDocuments (parece que la versión anterior puede haber hecho la conversión automáticamente). En su lugar, devuelven caracteres y DocumentTermMatrix no está seguro de cómo manejar un corpus de caracteres.

Así que podrías cambiar a

corpus_clean <- tm_map(news_corpus, content_transformer(tolower))

O puedes correr

corpus_clean <- tm_map(corpus_clean, PlainTextDocument)

después de que todas sus transformaciones no estándar (aquellas que no están en getTransformations() ) se realicen y justo antes de crear el DocumentTermMatrix. Eso debería asegurar que todos sus datos estén en PlainTextDocument y deberían hacer que DocumentTermMatrix sea feliz.