tensorflow - Procesador de vocabulario de flujo tensor

vocabulary (2)

Estoy siguiendo el blog de wildml en la clasificación de texto usando tensorflow. No puedo entender el propósito de max_document_length en la declaración de código:

vocab_processor = learn.preprocessing.VocabularyProcessor(max_document_length)

También, ¿cómo puedo extraer vocabulario del procesador vocab?

no es capaz de entender el propósito de max_document_length

El VocabularyProcessor mapea sus documentos de texto en vectores, y necesita que estos vectores tengan una longitud consistente.

Sus registros de datos de entrada pueden no (o probablemente no) tener la misma longitud. Por ejemplo, si está trabajando con oraciones para el análisis de sentimientos, tendrán varias longitudes.

Usted proporciona este parámetro al VocabularyProcessor para que pueda ajustar la longitud de los vectores de salida. Según la documentación ,

max_document_length : Longitud máxima de documentos. si los documentos son más largos, se recortarán, si son más cortos - acolchados.

Echa un vistazo al código fuente .

def transform(self, raw_documents): """Transform documents to word-id matrix. Convert words to ids with vocabulary fitted with fit or the one provided in the constructor. Args: raw_documents: An iterable which yield either str or unicode. Yields: x: iterable, [n_samples, max_document_length]. Word-id matrix. """ for tokens in self._tokenizer(raw_documents): word_ids = np.zeros(self.max_document_length, np.int64) for idx, token in enumerate(tokens): if idx >= self.max_document_length: break word_ids[idx] = self.vocabulary_.get(token) yield word_ids

Note la línea word_ids = np.zeros(self.max_document_length) .

Cada fila en la variable raw_documents se asignará a un vector de longitud max_document_length .

He descubierto cómo extraer el vocabulario del objeto procesador de vocabulario. Esto funcionó perfectamente para mí.

import numpy as np from tensorflow.contrib import learn x_text = [''This is a cat'',''This must be boy'', ''This is a a dog''] max_document_length = max([len(x.split(" ")) for x in x_text]) ## Create the vocabularyprocessor object, setting the max lengh of the documents. vocab_processor = learn.preprocessing.VocabularyProcessor(max_document_length) ## Transform the documents using the vocabulary. x = np.array(list(vocab_processor.fit_transform(x_text))) ## Extract word:id mapping from the object. vocab_dict = vocab_processor.vocabulary_._mapping ## Sort the vocabulary dictionary on the basis of values(id). ## Both statements perform same task. #sorted_vocab = sorted(vocab_dict.items(), key=operator.itemgetter(1)) sorted_vocab = sorted(vocab_dict.items(), key = lambda x : x[1]) ## Treat the id''s as index into list and create a list of words in the ascending order of id''s ## word with id i goes at index i of the list. vocabulary = list(list(zip(*sorted_vocab))[0]) print(vocabulary) print(x)