tfidfvectorizer tfidftransformer spanish idf hashingvectorizer example countvectorizer python machine-learning scikit-learn analyzer text-analysis

python - spanish - tfidftransformer



¿Cómo uso sklearn CountVectorizer con el analizador ''palabra'' y ''char''?-pitón (2)

¿Cómo uso sklearn CountVectorizer con el analizador ''palabra'' y ''char''? http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

Podría extraer las características del texto por palabra o char por separado, pero ¿cómo creo un charword_vectorizer ? ¿Hay alguna manera de combinar los vectores? o usar más de un analizador?

>>> from sklearn.feature_extraction.text import CountVectorizer >>> word_vectorizer = CountVectorizer(analyzer=''word'', ngram_range=(1, 2), min_df=1) >>> char_vectorizer = CountVectorizer(analyzer=''char'', ngram_range=(1, 2), min_df=1) >>> x = [''this is a foo bar'', ''you are a foo bar black sheep''] >>> word_vectorizer.fit_transform(x) <2x15 sparse matrix of type ''<type ''numpy.int64''>'' with 18 stored elements in Compressed Sparse Column format> >>> char_vectorizer.fit_transform(x) <2x47 sparse matrix of type ''<type ''numpy.int64''>'' with 64 stored elements in Compressed Sparse Column format> >>> char_vectorizer.get_feature_names() [u'' '', u'' a'', u'' b'', u'' f'', u'' i'', u'' s'', u''a'', u''a '', u''ac'', u''ar'', u''b'', u''ba'', u''bl'', u''c'', u''ck'', u''e'', u''e '', u''ee'', u''ep'', u''f'', u''fo'', u''h'', u''he'', u''hi'', u''i'', u''is'', u''k'', u''k '', u''l'', u''la'', u''o'', u''o '', u''oo'', u''ou'', u''p'', u''r'', u''r '', u''re'', u''s'', u''s '', u''sh'', u''t'', u''th'', u''u'', u''u '', u''y'', u''yo''] >>> word_vectorizer.get_feature_names() [u''are'', u''are foo'', u''bar'', u''bar black'', u''black'', u''black sheep'', u''foo'', u''foo bar'', u''is'', u''is foo'', u''sheep'', u''this'', u''this is'', u''you'', u''you are'']



Puede pasar un argumento invocable como analyzer para obtener un control total sobre la tokenización, por ejemplo

>>> from pprint import pprint >>> import re >>> x = [''this is a foo bar'', ''you are a foo bar black sheep''] >>> def words_and_char_bigrams(text): ... words = re.findall(r''/w{3,}'', text) ... for w in words: ... yield w ... for i in range(len(w) - 2): ... yield w[i:i+2] ... >>> v = CountVectorizer(analyzer=words_and_char_bigrams) >>> pprint(v.fit(x).vocabulary_) {''ac'': 0, ''ar'': 1, ''are'': 2, ''ba'': 3, ''bar'': 4, ''bl'': 5, ''black'': 6, ''ee'': 7, ''fo'': 8, ''foo'': 9, ''he'': 10, ''hi'': 11, ''la'': 12, ''sh'': 13, ''sheep'': 14, ''th'': 15, ''this'': 16, ''yo'': 17, ''you'': 18}