pivot_table groupby columns change python multithreading python-3.x pandas multiprocessing

python - groupby - pivot unpivot pandas



El método más eficiente para usar pandas pivot table en archivos grandes (1)

Error de sintaxis

Su código tiene una serie de errores sintácticos.

pool.submit(myfunc(folder), 1000)

El método pool.submit toma una función como primer argumento.

Por lo que veo su función, myfunc no devuelve nada, y definitivamente no es una función.

Aun así, a mi entender, está intentando iniciar a 1000 trabajadores, todos los cuales leen la misma carpeta y luego crean marcos de datos.

Problema de paralelización

En cualquier escenario de subprocesos, el número de trabajadores debe estar cerca del número de núcleos disponibles en la máquina que está ejecutando. Esto es de sentido común, no citaré nada.

La generación de 1000 trabajadores es una gran cantidad de gastos generales y es una fuente probable de su función lenta. Además, todos sus trabajadores parecen estar haciendo exactamente lo mismo, lo que por supuesto significa que usted hace el mismo trabajo 1000 veces.

Mi conjetura en el problema de pivote real

Por lo que escribe, aparte del código, entiendo que está tratando de crear un gran espacio de claves que le permita dividir cualquier métrica y profundizar en el conjunto de datos.

Estás haciendo esto usando una sola columna de lo que veo. Deberías dividirlos en columnas separadas. Como lo indicaron los comentaristas, pandas tiene columnas categóricas que podrían usarse, pero incluso sin ellas, el índice para el espacio clave será mucho más pequeño si las partes clave están en columnas separadas. Lo más probable es que su conjunto de datos actual tenga una clave separada para casi cada línea, por lo que no agrega más de unas pocas líneas juntas, dejando la tabla dinámica del mismo tamaño que el conjunto de datos original.

TLDR;

Divide tu columna clave en múltiples columnas, preferiblemente categóricas.

Estoy iterando sobre muchos registros de eventos de seguridad exportados extraídos de un host de Windows, ejemplo del marco de datos como a continuación:

"MachineName","EventID","EntryType","Source","TimeGenerated","TimeWritten","UserName","Message" "mycompname","5156","SuccessAudit","Microsoft-Windows-Security-Auditing","4/26/2017 10:47:41 AM","4/26/2017 10:47:41 AM",,"The Windows Filtering Platform has permitted a connection. Application Information: Process ID: 4 Application Name: System Network Information: Direction: %%14592 Source Address: 192.168.10.255 Source Port: 137 Destination Address: 192.168.10.238 Destination Port: 137 Protocol: 17 Filter Information: Filter Run-Time ID: 83695 Layer Name: %%14610 Layer Run-Time ID: 44" "mycompname","4688","SuccessAudit","Microsoft-Windows-Security-Auditing","4/26/2014 10:47:03 AM","4/26/2014 10:47:03 AM",,"A new process has been created. Subject: Security ID: S-1-5-18 Account Name: mycompname$ Account Domain: mydomain Logon ID: 0x3e7 Process Information: New Process ID: 0x1b04 New Process Name: C:/Windows/SysWOW64/Macromed/Flash/FlashPlayerUpdateService.exe Token Elevation Type: %%1936 Creator Process ID: 0x300 Process Command Line: C:/windows/SysWOW64/Macromed/Flash/FlashPlayerUpdateService.exe Token Elevation Type indicates the type of token that was assigned to the new process in accordance with User Account Control policy. Type 1 is a full token with no privileges removed or groups disabled. A full token is only used if User Account Control is disabled or if the user is the built-in Administrator account or a service account. Type 2 is an elevated token with no privileges removed or groups disabled. An elevated token is used when User Account Control is enabled and the user chooses to start the program using Run as administrator. An elevated token is also used when an application is configured to always require administrative privilege or to always require maximum privilege, and the user is a member of the Administrators group. Type 3 is a limited token with administrative privileges removed and administrative groups disabled. The limited token is used when User Account Control is enabled, the application does not require administrative privilege, and the user does not choose to start the program using Run as administrator." "mycompname","4673","SuccessAudit","Microsoft-Windows-Security-Auditing","4/26/2014 10:47:00 AM","4/26/2014 10:47:00 AM",,"A privileged service was called. Subject: Security ID: S-1-5-18 Account Name: mycompname$ Account Domain: mydomain Logon ID: 0x3e7 Service: Server: NT Local Security Authority / Authentication Service Service Name: LsaRegisterLogonProcess() Process: Process ID: 0x308 Process Name: C:/Windows/System32/lsass.exe Service Request Information: Privileges: SeTcbPrivilege"

Lo estoy convirtiendo para extraer los pares clave: valor de la columna "Mensaje" y convertir las claves en columnas como a continuación

def myfunc(folder): file = ''''.join(glob2.glob(folders + "//*security*")) df = pd.read_csv(file) df.message = df.message.replace(["[ ]{6}", "[ ]{3}"],[","," ||"], regex=True) message_results = df.message.str.extractall(r"/|([^/|]*?):(.*?)/|").reset_index() message_results.columns = ["org_index", "match", "keys", "vals"] # PART THAT TAKES THE LONGEST p = pd.pivot_table(message_results, values="vals", columns=[''keys''], index=["org_index"], aggfunc=np.sum) df = df.join(p).fillna("NONE")

Salida de la función anterior:

MachineName,EventID,EntryType,Source,TimeGenerated,TimeWritten,UserName,Message, Application Information, Filter Information, Network Information, Process, Process Information, Service, Service Request Information, Subject,Account Domain,Account Name,Application Name,Creator Process ID,Destination Address,Destination Port,Direction,Filter Run-Time ID,Layer Name,Logon ID,New Process ID,New Process Name,Process Command Line,Process ID,Process Name,Protocol,Security ID,Server,Service Name,Source Address,Source Port,Token Elevation Type mycompname,5156,SuccessAudit,Microsoft-Windows-Security-Auditing,4/26/2017 10:47:41 AM,4/26/2017 10:47:41 AM,NONE,The Windows Filtering Platform has permitted a connection. || Application Information: ||Process ID: 4 ||Application Name: System || Network Information: ||Direction: %%14592 ||Source Address: 192.168.10.255 ||Source Port: 137 ||Destination Address: 192.168.10.238 ||Destination Port: 137 ||Protocol: 17 || Filter Information: ||Filter Run-Time ID: 83695 ||Layer Name: %%14610 ||Layer Run-Time ID: 44, , , ,NONE,NONE,NONE,NONE,NONE,NONE,NONE, System ,NONE, 192.168.10.238 , 137 , %%14592 , 83695 , %%14610 ,NONE,NONE,NONE,NONE, 4 ,NONE, 17 ,NONE,NONE,NONE, 192.168.10.255 , 137 ,NONE mycompname,4688,SuccessAudit,Microsoft-Windows-Security-Auditing,4/26/2017 10:47:03 AM,4/26/2017 10:47:03 AM,NONE,"A new process has been created. || Subject: ||Security ID: S-1-5-18 ||Account Name: mycompname$ ||Account Domain: mydomain ||Logon ID: 0x3e7 || Process Information: ||New Process ID: 0x1b04 ||New Process Name: C:/Windows/SysWOW64/Macromed/Flash/FlashPlayerUpdateService.exe ||Token Elevation Type: %%1936 ||Creator Process ID: 0x300 ||Process Command Line: C:/windows/SysWOW64/Macromed/Flash/FlashPlayerUpdateService.exe || Token Elevation Type indicates the type of token that was assigned to the new process in accordance with User Account Control policy. || Type 1 is a full token with no privileges removed or groups disabled. A full token is only used if User Account Control is disabled or if the user is the built-in Administrator account or a service account. || Type 2 is an elevated token with no privileges removed or groups disabled. An elevated token is used when User Account Control is enabled and the user chooses to start the program using Run as administrator. An elevated token is also used when an application is configured to always require administrative privilege or to always require maximum privilege, and the user is a member of the Administrators group. || Type 3 is a limited token with administrative privileges removed and administrative groups disabled. The limited token is used when User Account Control is enabled, the application does not require administrative privilege, and the user does not choose to start the program using Run as administrator.",NONE,NONE,NONE,NONE, ,NONE,NONE, , mydomain , MEADWK4216DC190$ ,NONE, 0x300 ,NONE,NONE,NONE,NONE,NONE, 0x3e7 , 0x1b04 , C:/Windows/SysWOW64/Macromed/Flash/FlashPlayerUpdateService.exe , C:/windows/SysWOW64/Macromed/Flash/FlashPlayerUpdateService.exe ,NONE,NONE,NONE, S-1-5-18 ,NONE,NONE,NONE,NONE, %%1936 mycompname,4673,SuccessAudit,Microsoft-Windows-Security-Auditing,4/26/2017 10:47:00 AM,4/26/2017 10:47:00 AM,NONE,A privileged service was called. || Subject: ||Security ID: S-1-5-18 ||Account Name: mycompname$ ||Account Domain: mydomain ||Logon ID: 0x3e7 || Service: ||Server: NT Local Security Authority / Authentication Service ||Service Name: LsaRegisterLogonProcess() || Process: ||Process ID: 0x308 ||Process Name: C:/Windows/System32/lsass.exe || Service Request Information: ||Privileges: SeTcbPrivilege,NONE,NONE,NONE, ,NONE, , , , mydomain , mycompname$ ,NONE,NONE,NONE,NONE,NONE,NONE,NONE, 0x3e7 ,NONE,NONE,NONE, 0x308 , C:/Windows/System32/lsass.exe ,NONE, S-1-5-18 , NT Local Security Authority / Authentication Service , LsaRegisterLogonProcess() ,NONE,NONE,NONE

La funcionalidad del programa funciona, pero es increíblemente lenta en la parte p = pivot_table del código en conjuntos de datos más grandes (aproximadamente 150000 líneas).

Actualmente estoy usando concurrent.futures.ThreadPoolExecutor (maxworkers = 1000) iterando sobre cada lectura del archivo como a continuación:

with concurrent.futures.ThreadPoolExecutor(max_workers=1000) as pool: for folder in path: if os.path.isdir(folder): try: print(folder) pool.submit(myfunc(folder), 1000) except: print(''error'')

¿Cómo puedo acelerar mi parte de la tabla dinámica de mi función?

Además, ¿hay algún método para acelerar la llamada de pivot_table desde pandas?

Cualquier ayuda con esto sería muy apreciada. Gracias.