txt read leer importar datos como columnas cargar carga archivos archivo python regex pandas text extract

python - read - Cree Pandas DataFrame desde un archivo txt con un patrón específico



leer columnas en python (5)

Primero puede read_csv con el name parámetro para crear DataFrame con la columna Region Name , el separador es un valor que NO está en valores (como ; ):

df = pd.read_csv(''filename.txt'', sep=";", names=[''Region Name''])

Luego, insert nueva columna con las filas de extract donde el texto [edit] y replace todos los valores desde ( hasta el final de la columna Region Name .

df.insert(0, ''State'', df[''Region Name''].str.extract(''(.*)/[edit/]'', expand=False).ffill()) df[''Region Name''] = df[''Region Name''].str.replace(r'' /(.+$'', '''')

Última eliminar filas donde el texto [edit] mediante boolean indexing , la máscara es creada por str.contains :

df = df[~df[''Region Name''].str.contains(''/[edit/]'')].reset_index(drop=True) print (df) State Region Name 0 Alabama Auburn 1 Alabama Florence 2 Alabama Jacksonville 3 Alabama Livingston 4 Alabama Montevallo 5 Alabama Troy 6 Alabama Tuscaloosa 7 Alabama Tuskegee 8 Alaska Fairbanks 9 Arizona Flagstaff 10 Arizona Tempe 11 Arizona Tucson

Si es necesario, la solución de todos los valores es más fácil:

df = pd.read_csv(''filename.txt'', sep=";", names=[''Region Name'']) df.insert(0, ''State'', df[''Region Name''].str.extract(''(.*)/[edit/]'', expand=False).ffill()) df = df[~df[''Region Name''].str.contains(''/[edit/]'')].reset_index(drop=True) print (df) State Region Name 0 Alabama Auburn (Auburn University)[1] 1 Alabama Florence (University of North Alabama) 2 Alabama Jacksonville (Jacksonville State University)[2] 3 Alabama Livingston (University of West Alabama)[2] 4 Alabama Montevallo (University of Montevallo)[2] 5 Alabama Troy (Troy University)[2] 6 Alabama Tuscaloosa (University of Alabama, Stillman Co... 7 Alabama Tuskegee (Tuskegee University)[5] 8 Alaska Fairbanks (University of Alaska Fairbanks)[2] 9 Arizona Flagstaff (Northern Arizona University)[6] 10 Arizona Tempe (Arizona State University) 11 Arizona Tucson (University of Arizona)

Necesito crear un Pandas DataFrame basado en un archivo de texto basado en la siguiente estructura:

Alabama[edit] Auburn (Auburn University)[1] Florence (University of North Alabama) Jacksonville (Jacksonville State University)[2] Livingston (University of West Alabama)[2] Montevallo (University of Montevallo)[2] Troy (Troy University)[2] Tuscaloosa (University of Alabama, Stillman College, Shelton State)[3][4] Tuskegee (Tuskegee University)[5] Alaska[edit] Fairbanks (University of Alaska Fairbanks)[2] Arizona[edit] Flagstaff (Northern Arizona University)[6] Tempe (Arizona State University) Tucson (University of Arizona) Arkansas[edit]

Las filas con "[editar]" son Estados y las filas [número] son ​​Regiones. Necesito dividir lo siguiente y repetir el nombre del Estado para cada Nombre de Región a partir de entonces.

Index State Region Name 0 Alabama Aurburn... 1 Alabama Florence... 2 Alabama Jacksonville... ... 9 Alaska Fairbanks... 10 Alaska Arizona... 11 Alaska Flagstaff...

Pandas DataFrame

No estoy seguro de cómo dividir el archivo de texto basado en "[editar]" y "[número]" o "(caracteres)" en las columnas respectivas y repetir el nombre del estado para cada nombre de región. Por favor, ¿alguien puede darme un punto de partida para lograr lo siguiente?


Primero puede analizar el archivo en tuplas:

import pandas as pd from collections import namedtuple Item = namedtuple(''Item'', ''state area'') items = [] with open(''unis.txt'') as f: for line in f: l = line.rstrip(''/n'') if l.endswith(''[edit]''): state = l.rstrip(''[edit]'') else: i = l.index('' ('') area = l[:i] items.append(Item(state, area)) df = pd.DataFrame.from_records(items, columns=[''State'', ''Area'']) print df

salida:

State Area 0 Alabama Auburn 1 Alabama Florence 2 Alabama Jacksonville 3 Alabama Livingston 4 Alabama Montevallo 5 Alabama Troy 6 Alabama Tuscaloosa 7 Alabama Tuskegee 8 Alaska Fairbanks 9 Arizona Flagstaff 10 Arizona Tempe 11 Arizona Tucson


Probablemente necesitará realizar alguna manipulación adicional en el archivo antes de colocarlo en un marco de datos.

Un punto de partida sería dividir el archivo en líneas, buscar la cadena [edit] en cada línea, poner el nombre de la cadena como la clave de un diccionario cuando esté allí ...

No creo que Pandas haya incorporado ningún método que pueda manejar un archivo en este formato.


Suponiendo que tiene el siguiente DF:

In [73]: df Out[73]: text 0 Alabama[edit] 1 Auburn (Auburn University)[1] 2 Florence (University of North Alabama) 3 Jacksonville (Jacksonville State University)[2] 4 Livingston (University of West Alabama)[2] 5 Montevallo (University of Montevallo)[2] 6 Troy (Troy University)[2] 7 Tuscaloosa (University of Alabama, Stillman Co... 8 Tuskegee (Tuskegee University)[5] 9 Alaska[edit] 10 Fairbanks (University of Alaska Fairbanks)[2] 11 Arizona[edit] 12 Flagstaff (Northern Arizona University)[6] 13 Tempe (Arizona State University) 14 Tucson (University of Arizona) 15 Arkansas[edit]

puede usar el método extract :

In [117]: df[''State''] = df.loc[df.text.str.contains(''[edit]'', regex=False), ''text''].str.extract(r''(.*?)/[edit/]'', expand=False) In [118]: df[''Region Name''] = df.loc[df.State.isnull(), ''text''].str.extract(r''(.*?)/s*[/(/[]+.*[/n]*'', expand=False) In [120]: df.State = df.State.ffill() In [121]: df Out[121]: text State Region Name 0 Alabama[edit] Alabama NaN 1 Auburn (Auburn University)[1] Alabama Auburn 2 Florence (University of North Alabama) Alabama Florence 3 Jacksonville (Jacksonville State University)[2] Alabama Jacksonville 4 Livingston (University of West Alabama)[2] Alabama Livingston 5 Montevallo (University of Montevallo)[2] Alabama Montevallo 6 Troy (Troy University)[2] Alabama Troy 7 Tuscaloosa (University of Alabama, Stillman Co... Alabama Tuscaloosa 8 Tuskegee (Tuskegee University)[5] Alabama Tuskegee 9 Alaska[edit] Alaska NaN 10 Fairbanks (University of Alaska Fairbanks)[2] Alaska Fairbanks 11 Arizona[edit] Arizona NaN 12 Flagstaff (Northern Arizona University)[6] Arizona Flagstaff 13 Tempe (Arizona State University) Arizona Tempe 14 Tucson (University of Arizona) Arizona Tucson 15 Arkansas[edit] Arkansas NaN In [122]: df = df.dropna() In [123]: df Out[123]: text State Region Name 1 Auburn (Auburn University)[1] Alabama Auburn 2 Florence (University of North Alabama) Alabama Florence 3 Jacksonville (Jacksonville State University)[2] Alabama Jacksonville 4 Livingston (University of West Alabama)[2] Alabama Livingston 5 Montevallo (University of Montevallo)[2] Alabama Montevallo 6 Troy (Troy University)[2] Alabama Troy 7 Tuscaloosa (University of Alabama, Stillman Co... Alabama Tuscaloosa 8 Tuskegee (Tuskegee University)[5] Alabama Tuskegee 10 Fairbanks (University of Alaska Fairbanks)[2] Alaska Fairbanks 12 Flagstaff (Northern Arizona University)[6] Arizona Flagstaff 13 Tempe (Arizona State University) Arizona Tempe 14 Tucson (University of Arizona) Arizona Tucson


TL; DR
s.groupby(s.str.extract(''(?P<State>.*?)/[edit/]'', expand=False).ffill()).apply(pd.Series.tail, n=-1).reset_index(name=''Region_Name'').iloc[:, [0, 2]]

regex = ''(?P<State>.*?)/[edit/]'' # pattern to match print(s.groupby( # will get nulls where we don''t have "[edit]" # forward fill fills in the most recent line # where we did have an "[edit]" s.str.extract(regex, expand=False).ffill() ).apply( # I still have all the original values # If I group by the forward filled rows # I''ll want to drop the first one within each group pd.Series.tail, n=-1 ).reset_index( # munge the dataframe to get columns sorted name=''Region_Name'' )[[''State'', ''Region_Name'']]) State Region_Name 0 Alabama Auburn (Auburn University)[1] 1 Alabama Florence (University of North Alabama) 2 Alabama Jacksonville (Jacksonville State University)[2] 3 Alabama Livingston (University of West Alabama)[2] 4 Alabama Montevallo (University of Montevallo)[2] 5 Alabama Troy (Troy University)[2] 6 Alabama Tuscaloosa (University of Alabama, Stillman Co... 7 Alabama Tuskegee (Tuskegee University)[5] 8 Alaska Fairbanks (University of Alaska Fairbanks)[2] 9 Arizona Flagstaff (Northern Arizona University)[6] 10 Arizona Tempe (Arizona State University) 11 Arizona Tucson (University of Arizona)

preparar

txt = """Alabama[edit] Auburn (Auburn University)[1] Florence (University of North Alabama) Jacksonville (Jacksonville State University)[2] Livingston (University of West Alabama)[2] Montevallo (University of Montevallo)[2] Troy (Troy University)[2] Tuscaloosa (University of Alabama, Stillman College, Shelton State)[3][4] Tuskegee (Tuskegee University)[5] Alaska[edit] Fairbanks (University of Alaska Fairbanks)[2] Arizona[edit] Flagstaff (Northern Arizona University)[6] Tempe (Arizona State University) Tucson (University of Arizona) Arkansas[edit]""" s = pd.read_csv(StringIO(txt), sep=''|'', header=None, squeeze=True)