python - read - Cree Pandas DataFrame desde un archivo txt con un patrón específico
leer columnas en python (5)
Primero puede
read_csv
con el
name
parámetro para crear
DataFrame
con la columna
Region Name
, el separador es un valor que NO está en valores (como
;
):
df = pd.read_csv(''filename.txt'', sep=";", names=[''Region Name''])
Luego,
insert
nueva columna con las filas de
extract
donde el texto
[edit]
y
replace
todos los valores desde
(
hasta el final de la columna
Region Name
.
df.insert(0, ''State'', df[''Region Name''].str.extract(''(.*)/[edit/]'', expand=False).ffill())
df[''Region Name''] = df[''Region Name''].str.replace(r'' /(.+$'', '''')
Última eliminar filas donde el texto
[edit]
mediante
boolean indexing
, la máscara es creada por
str.contains
:
df = df[~df[''Region Name''].str.contains(''/[edit/]'')].reset_index(drop=True)
print (df)
State Region Name
0 Alabama Auburn
1 Alabama Florence
2 Alabama Jacksonville
3 Alabama Livingston
4 Alabama Montevallo
5 Alabama Troy
6 Alabama Tuscaloosa
7 Alabama Tuskegee
8 Alaska Fairbanks
9 Arizona Flagstaff
10 Arizona Tempe
11 Arizona Tucson
Si es necesario, la solución de todos los valores es más fácil:
df = pd.read_csv(''filename.txt'', sep=";", names=[''Region Name''])
df.insert(0, ''State'', df[''Region Name''].str.extract(''(.*)/[edit/]'', expand=False).ffill())
df = df[~df[''Region Name''].str.contains(''/[edit/]'')].reset_index(drop=True)
print (df)
State Region Name
0 Alabama Auburn (Auburn University)[1]
1 Alabama Florence (University of North Alabama)
2 Alabama Jacksonville (Jacksonville State University)[2]
3 Alabama Livingston (University of West Alabama)[2]
4 Alabama Montevallo (University of Montevallo)[2]
5 Alabama Troy (Troy University)[2]
6 Alabama Tuscaloosa (University of Alabama, Stillman Co...
7 Alabama Tuskegee (Tuskegee University)[5]
8 Alaska Fairbanks (University of Alaska Fairbanks)[2]
9 Arizona Flagstaff (Northern Arizona University)[6]
10 Arizona Tempe (Arizona State University)
11 Arizona Tucson (University of Arizona)
Necesito crear un Pandas DataFrame basado en un archivo de texto basado en la siguiente estructura:
Alabama[edit]
Auburn (Auburn University)[1]
Florence (University of North Alabama)
Jacksonville (Jacksonville State University)[2]
Livingston (University of West Alabama)[2]
Montevallo (University of Montevallo)[2]
Troy (Troy University)[2]
Tuscaloosa (University of Alabama, Stillman College, Shelton State)[3][4]
Tuskegee (Tuskegee University)[5]
Alaska[edit]
Fairbanks (University of Alaska Fairbanks)[2]
Arizona[edit]
Flagstaff (Northern Arizona University)[6]
Tempe (Arizona State University)
Tucson (University of Arizona)
Arkansas[edit]
Las filas con "[editar]" son Estados y las filas [número] son Regiones. Necesito dividir lo siguiente y repetir el nombre del Estado para cada Nombre de Región a partir de entonces.
Index State Region Name
0 Alabama Aurburn...
1 Alabama Florence...
2 Alabama Jacksonville...
...
9 Alaska Fairbanks...
10 Alaska Arizona...
11 Alaska Flagstaff...
Pandas DataFrame
No estoy seguro de cómo dividir el archivo de texto basado en "[editar]" y "[número]" o "(caracteres)" en las columnas respectivas y repetir el nombre del estado para cada nombre de región. Por favor, ¿alguien puede darme un punto de partida para lograr lo siguiente?
Primero puede analizar el archivo en tuplas:
import pandas as pd
from collections import namedtuple
Item = namedtuple(''Item'', ''state area'')
items = []
with open(''unis.txt'') as f:
for line in f:
l = line.rstrip(''/n'')
if l.endswith(''[edit]''):
state = l.rstrip(''[edit]'')
else:
i = l.index('' ('')
area = l[:i]
items.append(Item(state, area))
df = pd.DataFrame.from_records(items, columns=[''State'', ''Area''])
print df
salida:
State Area
0 Alabama Auburn
1 Alabama Florence
2 Alabama Jacksonville
3 Alabama Livingston
4 Alabama Montevallo
5 Alabama Troy
6 Alabama Tuscaloosa
7 Alabama Tuskegee
8 Alaska Fairbanks
9 Arizona Flagstaff
10 Arizona Tempe
11 Arizona Tucson
Probablemente necesitará realizar alguna manipulación adicional en el archivo antes de colocarlo en un marco de datos.
Un punto de partida sería dividir el archivo en líneas, buscar la cadena
[edit]
en cada línea, poner el nombre de la cadena como la clave de un diccionario cuando esté allí ...
No creo que Pandas haya incorporado ningún método que pueda manejar un archivo en este formato.
Suponiendo que tiene el siguiente DF:
In [73]: df
Out[73]:
text
0 Alabama[edit]
1 Auburn (Auburn University)[1]
2 Florence (University of North Alabama)
3 Jacksonville (Jacksonville State University)[2]
4 Livingston (University of West Alabama)[2]
5 Montevallo (University of Montevallo)[2]
6 Troy (Troy University)[2]
7 Tuscaloosa (University of Alabama, Stillman Co...
8 Tuskegee (Tuskegee University)[5]
9 Alaska[edit]
10 Fairbanks (University of Alaska Fairbanks)[2]
11 Arizona[edit]
12 Flagstaff (Northern Arizona University)[6]
13 Tempe (Arizona State University)
14 Tucson (University of Arizona)
15 Arkansas[edit]
puede usar el método extract :
In [117]: df[''State''] = df.loc[df.text.str.contains(''[edit]'', regex=False), ''text''].str.extract(r''(.*?)/[edit/]'', expand=False)
In [118]: df[''Region Name''] = df.loc[df.State.isnull(), ''text''].str.extract(r''(.*?)/s*[/(/[]+.*[/n]*'', expand=False)
In [120]: df.State = df.State.ffill()
In [121]: df
Out[121]:
text State Region Name
0 Alabama[edit] Alabama NaN
1 Auburn (Auburn University)[1] Alabama Auburn
2 Florence (University of North Alabama) Alabama Florence
3 Jacksonville (Jacksonville State University)[2] Alabama Jacksonville
4 Livingston (University of West Alabama)[2] Alabama Livingston
5 Montevallo (University of Montevallo)[2] Alabama Montevallo
6 Troy (Troy University)[2] Alabama Troy
7 Tuscaloosa (University of Alabama, Stillman Co... Alabama Tuscaloosa
8 Tuskegee (Tuskegee University)[5] Alabama Tuskegee
9 Alaska[edit] Alaska NaN
10 Fairbanks (University of Alaska Fairbanks)[2] Alaska Fairbanks
11 Arizona[edit] Arizona NaN
12 Flagstaff (Northern Arizona University)[6] Arizona Flagstaff
13 Tempe (Arizona State University) Arizona Tempe
14 Tucson (University of Arizona) Arizona Tucson
15 Arkansas[edit] Arkansas NaN
In [122]: df = df.dropna()
In [123]: df
Out[123]:
text State Region Name
1 Auburn (Auburn University)[1] Alabama Auburn
2 Florence (University of North Alabama) Alabama Florence
3 Jacksonville (Jacksonville State University)[2] Alabama Jacksonville
4 Livingston (University of West Alabama)[2] Alabama Livingston
5 Montevallo (University of Montevallo)[2] Alabama Montevallo
6 Troy (Troy University)[2] Alabama Troy
7 Tuscaloosa (University of Alabama, Stillman Co... Alabama Tuscaloosa
8 Tuskegee (Tuskegee University)[5] Alabama Tuskegee
10 Fairbanks (University of Alaska Fairbanks)[2] Alaska Fairbanks
12 Flagstaff (Northern Arizona University)[6] Arizona Flagstaff
13 Tempe (Arizona State University) Arizona Tempe
14 Tucson (University of Arizona) Arizona Tucson
TL; DR
s.groupby(s.str.extract(''(?P<State>.*?)/[edit/]'', expand=False).ffill()).apply(pd.Series.tail, n=-1).reset_index(name=''Region_Name'').iloc[:, [0, 2]]
regex = ''(?P<State>.*?)/[edit/]'' # pattern to match
print(s.groupby(
# will get nulls where we don''t have "[edit]"
# forward fill fills in the most recent line
# where we did have an "[edit]"
s.str.extract(regex, expand=False).ffill()
).apply(
# I still have all the original values
# If I group by the forward filled rows
# I''ll want to drop the first one within each group
pd.Series.tail, n=-1
).reset_index(
# munge the dataframe to get columns sorted
name=''Region_Name''
)[[''State'', ''Region_Name'']])
State Region_Name
0 Alabama Auburn (Auburn University)[1]
1 Alabama Florence (University of North Alabama)
2 Alabama Jacksonville (Jacksonville State University)[2]
3 Alabama Livingston (University of West Alabama)[2]
4 Alabama Montevallo (University of Montevallo)[2]
5 Alabama Troy (Troy University)[2]
6 Alabama Tuscaloosa (University of Alabama, Stillman Co...
7 Alabama Tuskegee (Tuskegee University)[5]
8 Alaska Fairbanks (University of Alaska Fairbanks)[2]
9 Arizona Flagstaff (Northern Arizona University)[6]
10 Arizona Tempe (Arizona State University)
11 Arizona Tucson (University of Arizona)
preparar
txt = """Alabama[edit]
Auburn (Auburn University)[1]
Florence (University of North Alabama)
Jacksonville (Jacksonville State University)[2]
Livingston (University of West Alabama)[2]
Montevallo (University of Montevallo)[2]
Troy (Troy University)[2]
Tuscaloosa (University of Alabama, Stillman College, Shelton State)[3][4]
Tuskegee (Tuskegee University)[5]
Alaska[edit]
Fairbanks (University of Alaska Fairbanks)[2]
Arizona[edit]
Flagstaff (Northern Arizona University)[6]
Tempe (Arizona State University)
Tucson (University of Arizona)
Arkansas[edit]"""
s = pd.read_csv(StringIO(txt), sep=''|'', header=None, squeeze=True)