python - graphs - nltk tutorial
Expandiendo las contracciones del idioma inglés en Python (6)
A pesar de que esta es una pregunta antigua, pensé que sería mejor responder, ya que no hay una solución real para esto, por lo que puedo ver.
He tenido que trabajar en esto en un proyecto de PNL relacionado y decidí abordar el problema ya que no parecía haber nada aquí. Puedes consultar mi repositorio github expansor si estás interesado.
Es un programa bastante mal optimizado (creo) basado en NLTK, los modelos de NLP de Stanford Core, que tendrá que descargar por separado, y share . Toda la información necesaria debe estar en el archivo README y en el código profusamente comentado. Sé que el código comentado es código muerto, pero así escribo para mantener las cosas claras para mí.
El ejemplo de entrada en expander.py
son las siguientes oraciones:
["I won''t let you get away with that", # won''t -> will not
"I''m a bad person", # ''m -> am
"It''s his cat anyway", # ''s -> is
"It''s not what you think", # ''s -> is
"It''s a man''s world", # ''s -> is and ''s possessive
"Catherine''s been thinking about it", # ''s -> has
"It''ll be done", # ''ll -> will
"Who''d''ve thought!", # ''d -> would, ''ve -> have
"She said she''d go.", # she''d -> she would
"She said she''d gone.", # she''d -> had
"Y''all''d''ve a great time, wouldn''t it be so cold!", # Y''all''d''ve -> You all would have, wouldn''t -> would not
" My name is Jack.", # No replacements.
"''Tis questionable whether Ma''am should be going.", # ''Tis -> it is, Ma''am -> madam
"As history tells, ''twas the night before Christmas.", # ''Twas -> It was
"Martha, Peter and Christine''ve been indulging in a menage-à-trois."] # ''ve -> have
A lo que la salida es
["I will not let you get away with that",
"I am a bad person",
"It is his cat anyway",
"It is not what you think",
"It is a man''s world",
"Catherine has been thinking about it",
"It will be done",
"Who would have thought!",
"She said she would go.",
"She said she had gone.",
"You all would have a great time, would not it be so cold!",
"My name is Jack.",
"It is questionable whether Madam should be going.",
"As history tells, it was the night before Christmas.",
"Martha, Peter and Christine have been indulging in a menage-à-trois."]
Así que para este pequeño conjunto de oraciones de prueba, se me ocurrió probar algunos casos de borde, funciona bien.
Como este proyecto ha perdido importancia en este momento, ya no lo estoy desarrollando activamente. Cualquier ayuda en este proyecto sería apreciada. Las cosas por hacer están escritas en la lista de TODO. O si tiene algún consejo sobre cómo mejorar mi python, también estaría muy agradecido.
El idioma inglés tiene un par de contracciones . Por ejemplo:
you''ve -> you have
he''s -> he is
En ocasiones, esto puede causar dolor de cabeza cuando se realiza un procesamiento de lenguaje natural. ¿Hay una biblioteca de Python, que puede expandir estas contracciones?
Esta es una biblioteca muy fresca y fácil de usar para el propósito https://pypi.python.org/pypi/pycontractions/1.0.1 .
Ejemplo de uso (detallado en el enlace):
from pycontractions import Contractions
# Load your favorite word2vec model
cont = Contractions(''GoogleNews-vectors-negative300.bin'')
# optional, prevents loading on first expand_texts call
cont.load_models()
out = list(cont.expand_texts(["I''d like to know how I''d done that!",
"We''re going to the zoo and I don''t think I''ll be home for dinner.",
"Theyre going to the zoo and she''ll be home for dinner."], precise=True))
print(out)
También necesitarás GoogleNews-vectors-negative300.bin, enlace para descargar en el enlace de pycontracciones arriba. * Código de ejemplo en python3.
Hice esa página de contracción a expansión de wikipedia en un diccionario de python (ver más abajo)
Tenga en cuenta, como es de esperar, que definitivamente desea utilizar comillas dobles al consultar el diccionario:
Además, he dejado múltiples opciones como en la página de wikipedia. Siéntete libre de modificarlo como desees. Tenga en cuenta que la desambiguación a la expansión correcta sería un problema difícil.
contractions = {
"ain''t": "am not / are not / is not / has not / have not",
"aren''t": "are not / am not",
"can''t": "cannot",
"can''t''ve": "cannot have",
"''cause": "because",
"could''ve": "could have",
"couldn''t": "could not",
"couldn''t''ve": "could not have",
"didn''t": "did not",
"doesn''t": "does not",
"don''t": "do not",
"hadn''t": "had not",
"hadn''t''ve": "had not have",
"hasn''t": "has not",
"haven''t": "have not",
"he''d": "he had / he would",
"he''d''ve": "he would have",
"he''ll": "he shall / he will",
"he''ll''ve": "he shall have / he will have",
"he''s": "he has / he is",
"how''d": "how did",
"how''d''y": "how do you",
"how''ll": "how will",
"how''s": "how has / how is / how does",
"I''d": "I had / I would",
"I''d''ve": "I would have",
"I''ll": "I shall / I will",
"I''ll''ve": "I shall have / I will have",
"I''m": "I am",
"I''ve": "I have",
"isn''t": "is not",
"it''d": "it had / it would",
"it''d''ve": "it would have",
"it''ll": "it shall / it will",
"it''ll''ve": "it shall have / it will have",
"it''s": "it has / it is",
"let''s": "let us",
"ma''am": "madam",
"mayn''t": "may not",
"might''ve": "might have",
"mightn''t": "might not",
"mightn''t''ve": "might not have",
"must''ve": "must have",
"mustn''t": "must not",
"mustn''t''ve": "must not have",
"needn''t": "need not",
"needn''t''ve": "need not have",
"o''clock": "of the clock",
"oughtn''t": "ought not",
"oughtn''t''ve": "ought not have",
"shan''t": "shall not",
"sha''n''t": "shall not",
"shan''t''ve": "shall not have",
"she''d": "she had / she would",
"she''d''ve": "she would have",
"she''ll": "she shall / she will",
"she''ll''ve": "she shall have / she will have",
"she''s": "she has / she is",
"should''ve": "should have",
"shouldn''t": "should not",
"shouldn''t''ve": "should not have",
"so''ve": "so have",
"so''s": "so as / so is",
"that''d": "that would / that had",
"that''d''ve": "that would have",
"that''s": "that has / that is",
"there''d": "there had / there would",
"there''d''ve": "there would have",
"there''s": "there has / there is",
"they''d": "they had / they would",
"they''d''ve": "they would have",
"they''ll": "they shall / they will",
"they''ll''ve": "they shall have / they will have",
"they''re": "they are",
"they''ve": "they have",
"to''ve": "to have",
"wasn''t": "was not",
"we''d": "we had / we would",
"we''d''ve": "we would have",
"we''ll": "we will",
"we''ll''ve": "we will have",
"we''re": "we are",
"we''ve": "we have",
"weren''t": "were not",
"what''ll": "what shall / what will",
"what''ll''ve": "what shall have / what will have",
"what''re": "what are",
"what''s": "what has / what is",
"what''ve": "what have",
"when''s": "when has / when is",
"when''ve": "when have",
"where''d": "where did",
"where''s": "where has / where is",
"where''ve": "where have",
"who''ll": "who shall / who will",
"who''ll''ve": "who shall have / who will have",
"who''s": "who has / who is",
"who''ve": "who have",
"why''s": "why has / why is",
"why''ve": "why have",
"will''ve": "will have",
"won''t": "will not",
"won''t''ve": "will not have",
"would''ve": "would have",
"wouldn''t": "would not",
"wouldn''t''ve": "would not have",
"y''all": "you all",
"y''all''d": "you all would",
"y''all''d''ve": "you all would have",
"y''all''re": "you all are",
"y''all''ve": "you all have",
"you''d": "you had / you would",
"you''d''ve": "you would have",
"you''ll": "you shall / you will",
"you''ll''ve": "you shall have / you will have",
"you''re": "you are",
"you''ve": "you have"
}
Las respuestas anteriores funcionarán perfectamente bien y podrían ser mejores para una contracción ambigua (aunque diría que no hay muchos casos ambiguos). Usaría algo más legible y más fácil de mantener:
import re
def decontracted(phrase):
# specific
phrase = re.sub(r"won''t", "will not", phrase)
phrase = re.sub(r"can/'t", "can not", phrase)
# general
phrase = re.sub(r"n/'t", " not", phrase)
phrase = re.sub(r"/'re", " are", phrase)
phrase = re.sub(r"/'s", " is", phrase)
phrase = re.sub(r"/'d", " would", phrase)
phrase = re.sub(r"/'ll", " will", phrase)
phrase = re.sub(r"/'t", " not", phrase)
phrase = re.sub(r"/'ve", " have", phrase)
phrase = re.sub(r"/'m", " am", phrase)
return phrase
test = "Hey I''m Yann, how''re you and how''s it going ? That''s interesting: I''d love to hear more about it."
print(decontracted(test))
# Hey I am Yann, how are you and how is it going ? That is interesting: I would love to hear more about it.
Puede que tenga algunas fallas en las que no pensé.
Repuesto de mi otra respuesta
Me gustaría añadir poco a la respuesta de alko aquí. Si marca la wikipedia, el número de contracciones del idioma inglés, como se mencionó, son menos de 100. Concedido, en un escenario real este número podría ser más que eso. Pero aún así, estoy bastante seguro de que 200-300 palabras son todo lo que tendrá para las palabras de contracción en inglés. Ahora, ¿quieres obtener una biblioteca separada para ellos (aunque no creo que lo que buscas realmente exista)? Sin embargo, puede resolver fácilmente este problema con el diccionario y el uso de expresiones regulares. Recomendaría usar un tokenizador agradable como Natural Language Toolkit y el resto no debería tener problemas en implementarse.
No necesita una biblioteca, se puede hacer con reg exp por ejemplo.
>>> import re
>>> contractions_dict = {
... ''didn/'t'': ''did not'',
... ''don/'t'': ''do not'',
... }
>>> contractions_re = re.compile(''(%s)'' % ''|''.join(contractions_dict.keys()))
>>> def expand_contractions(s, contractions_dict=contractions_dict):
... def replace(match):
... return contractions_dict[match.group(0)]
... return contractions_re.sub(replace, s)
...
>>> expand_contractions(''You don/'t need a library'')
''You do not need a library''