repetir - sql eliminar registros duplicados menos uno

Conteo ordenado de repeticiones/duplicados consecutivos (1)

Caso de prueba

Primero, una forma más útil de presentar sus datos, o mejor aún, en un sqlfiddle , listo para jugar:

CREATE TEMP TABLE data( system_measured int , time_of_measurement int , measurement int ); INSERT INTO data VALUES (1, 1, 5) ,(1, 2, 150) ,(1, 3, 5) ,(1, 4, 5) ,(2, 1, 5) ,(2, 2, 5) ,(2, 3, 5) ,(2, 4, 5) ,(2, 5, 150) ,(2, 6, 5) ,(2, 7, 5) ,(2, 8, 5);

Consulta simplificada

Dado que no está claro, estoy asumiendo solo lo anterior como dado.
A continuación, simplifiqué tu consulta para llegar a:

WITH x AS ( SELECT *, CASE WHEN lag(measurement) OVER (PARTITION BY system_measured ORDER BY time_of_measurement) = measurement THEN 0 ELSE 1 END AS step FROM data ) , y AS ( SELECT *, sum(step) OVER(PARTITION BY system_measured ORDER BY time_of_measurement) AS grp FROM x ) SELECT * ,row_number() OVER (PARTITION BY system_measured, grp ORDER BY time_of_measurement) - 1 AS repeat_ct FROM y ORDER BY system_measured, time_of_measurement;

Ahora bien, si bien es agradable y brillante usar SQL puro, esto será mucho más rápido con una función plpgsql, porque puede hacerlo en un escaneo de tabla única donde esta consulta necesita al menos tres escaneos.

Más rápido con la función plpgsql:

CREATE OR REPLACE FUNCTION x.f_repeat_ct() RETURNS TABLE ( system_measured int , time_of_measurement int , measurement int, repeat_ct int ) LANGUAGE plpgsql AS $func$ DECLARE r data; -- table name serves as record type r0 data; BEGIN -- SET LOCAL work_mem = ''1000 MB''; -- uncomment an adapt if needed, see below! repeat_ct := 0; -- init FOR r IN SELECT * FROM data d ORDER BY d.system_measured, d.time_of_measurement LOOP IF r.system_measured = r0.system_measured AND r.measurement = r0.measurement THEN repeat_ct := repeat_ct + 1; -- start new array ELSE repeat_ct := 0; -- start new count END IF; RETURN QUERY SELECT r.*, repeat_ct; r0 := r; -- remember last row END LOOP; END $func$;

Llamada:

SELECT * FROM x.f_repeat_ct();

Asegúrese de calificar en tabla los nombres de sus columnas en todo momento en este tipo de función plpgsl, ya que usamos los mismos nombres que los parámetros de salida que tendrían prioridad si no están calificados.

Billones de filas

Si tiene miles de millones de filas , es posible que desee dividir esta operación. Cito el manual aquí :

Nota: La implementación actual de RETURN NEXT y RETURN QUERY almacena todo el conjunto de resultados antes de regresar de la función, como se explicó anteriormente. Eso significa que si una función PL / pgSQL produce un conjunto de resultados muy grande, el rendimiento puede ser pobre: los datos se escribirán en el disco para evitar el agotamiento de la memoria, pero la función no regresará hasta que se haya generado todo el conjunto de resultados. Una versión futura de PL / pgSQL podría permitir a los usuarios definir funciones de devolución de conjuntos que no tienen esta limitación. Actualmente, el punto en el que los datos comienzan a escribirse en el disco está controlado por la variable de configuración work_mem . Los administradores que tienen suficiente memoria para almacenar conjuntos de resultados más grandes en la memoria deberían considerar aumentar este parámetro.

Considere calcular filas para un sistema a la vez o establezca un valor lo suficientemente alto para work_mem para hacer frente a la carga. Siga el enlace proporcionado en la cita en más sobre work_mem.

Una forma sería establecer un valor muy alto para work_mem con SET LOCAL en su función, que solo es efectivo para la transacción actual. Agregué una línea comentada en la función. No lo configure muy alto a nivel mundial, ya que esto podría dañar su servidor. Lee el manual.

Dudo mucho que esté haciendo esto de la manera más eficiente, y es por eso que plpgsql aquí. Necesito ejecutar esto en 2 mil millones de filas para mil sistemas de medición .

Tiene sistemas de medición que a menudo informan el valor anterior cuando pierden la conectividad, y pierden la conectividad a menudo, pero a veces durante mucho tiempo. Debe agregar, pero cuando lo hace, debe ver cuánto tiempo se repite y hacer varios filtros basados en esa información. Digamos que está midiendo mpg en un automóvil, pero está atascado a 20 mpg durante una hora que a 20.1 y así sucesivamente. Querrá evaluar la precisión cuando esté atascado. También puede colocar algunas reglas alternativas que se deben buscar cuando el automóvil está en la carretera, y con las funciones de ventana puede generar el "estado" del automóvil y tener algo para agrupar. Sin más preámbulos:

--here''s my data, you have different systems, the time of measurement, and the actual measurement --as well, the raw data has whether or not it''s a repeat (hense the included window function select * into temporary table cumulative_repeat_calculator_data FROM ( select system_measured, time_of_measurement, measurement, case when measurement = lag(measurement,1) over (partition by system_measured order by time_of_measurement asc) then 1 else 0 end as repeat FROM ( SELECT 5 as measurement, 1 as time_of_measurement, 1 as system_measured UNION SELECT 150 as measurement, 2 as time_of_measurement, 1 as system_measured UNION SELECT 5 as measurement, 3 as time_of_measurement, 1 as system_measured UNION SELECT 5 as measurement, 4 as time_of_measurement, 1 as system_measured UNION SELECT 5 as measurement, 1 as time_of_measurement, 2 as system_measured UNION SELECT 5 as measurement, 2 as time_of_measurement, 2 as system_measured UNION SELECT 5 as measurement, 3 as time_of_measurement, 2 as system_measured UNION SELECT 5 as measurement, 4 as time_of_measurement, 2 as system_measured UNION SELECT 150 as measurement, 5 as time_of_measurement, 2 as system_measured UNION SELECT 5 as measurement, 6 as time_of_measurement, 2 as system_measured UNION SELECT 5 as measurement, 7 as time_of_measurement, 2 as system_measured UNION SELECT 5 as measurement, 8 as time_of_measurement, 2 as system_measured ) as data ) as data; --unfortunately you can''t have window functions within window functions, so I had to break it down into subquery --what we need is something to partion on, the ''state'' of the system if you will, so I ran a running total of the nonrepeats --this creates a row that stays the same when your data is repeating - aka something you can partition/group on select * into temporary table cumulative_repeat_calculator_step_1 FROM ( select *, sum(case when repeat = 0 then 1 else 0 end) over (partition by system_measured order by time_of_measurement asc) as cumlative_sum_of_nonrepeats_by_system from cumulative_repeat_calculator_data order by system_measured, time_of_measurement ) as data; --finally, the query. I didn''t bother showing my desired output, because this (finally) got it --I wanted a sequential count of repeats that restarts when it stops repeating, and starts with the first repeat --what you can do now is take the average measurement under some condition based on how long it was repeating, for example select *, case when repeat = 0 then 0 else row_number() over (partition by cumlative_sum_of_nonrepeats_by_system, system_measured order by time_of_measurement) - 1 end as ordered_repeat from cumulative_repeat_calculator_step_1 order by system_measured, time_of_measurement

Entonces, ¿qué harías de manera diferente para ejecutar esto en una mesa enorme, o qué herramientas alternativas usarías? Estoy pensando en plpgsql porque sospecho que esto debe hacerse en la base de datos, o durante el proceso de inserción de datos, aunque generalmente trabajo con los datos después de que se carguen. ¿Hay alguna forma de obtener esto en un solo barrido sin recurrir a subconsultas?

He probado un método alternativo , pero todavía depende de una subconsulta y creo que es más rápido. Para ese método, crea una tabla de "arranques y paradas" con start_timestamp, end_timestamp, system. Luego te unes a la tabla más grande y si la marca de tiempo está entre esas, la clasificas como en ese estado, que es esencialmente una alternativa a cumlative_sum_of_nonrepeats_by_system . Pero cuando haces esto, te unes a 1 = 1 para miles de dispositivos y miles o millones de ''eventos''. ¿Crees que es una mejor forma de hacerlo?