python - unir - Fusionar listas que comparten elementos comunes

llenar una lista en python (10)

Mi entrada es una lista de listas. Algunos de ellos comparten elementos comunes, ej.

L = [[''a'',''b'',''c''],[''b'',''d'',''e''],[''k''],[''o'',''p''],[''e'',''f''],[''p'',''a''],[''d'',''g'']]

Necesito fusionar todas las listas, que comparten un elemento común, y repetir este procedimiento siempre que no haya más listas con el mismo elemento. Pensé en usar operaciones booleanas y un ciclo while, pero no pude encontrar una buena solución.

El resultado final debería ser:

L = [[''a'',''b'',''c'',''d'',''e'',''f'',''g'',''o'',''p''],[''k'']]

Algoritmo:

tomar el primer conjunto A de la lista
para cada otro conjunto B en la lista hacer si B tiene elemento (s) común (s) con A unir B en A; eliminar B de la lista
repita 2. hasta que no haya más superposición con A
poner A en outpup
repita 1. con el resto de la lista

Por lo tanto, es posible que desee utilizar conjuntos en lugar de listas. El siguiente programa debería hacerlo.

l = [[''a'', ''b'', ''c''], [''b'', ''d'', ''e''], [''k''], [''o'', ''p''], [''e'', ''f''], [''p'', ''a''], [''d'', ''g'']] out = [] while len(l)>0: first, *rest = l first = set(first) lf = -1 while len(first)>lf: lf = len(first) rest2 = [] for r in rest: if len(first.intersection(set(r)))>0: first |= set(r) else: rest2.append(r) rest = rest2 out.append(first) l = rest print(out)

Como Jochen Ritzel señaló, está buscando componentes conectados en un gráfico. Aquí es cómo podría implementarlo sin utilizar una biblioteca de gráficos:

from collections import defaultdict def connected_components(lists): neighbors = defaultdict(set) seen = set() for each in lists: for item in each: neighbors[item].update(each) def component(node, neighbors=neighbors, seen=seen, see=seen.add): nodes = set([node]) next_node = nodes.pop while nodes: node = next_node() see(node) nodes |= neighbors[node] - seen yield node for node in neighbors: if node not in seen: yield sorted(component(node)) L = [[''a'',''b'',''c''],[''b'',''d'',''e''],[''k''],[''o'',''p''],[''e'',''f''],[''p'',''a''],[''d'',''g'']] print list(connected_components(L))

Creo que esto se puede resolver modelando el problema como un graph . Cada sublista es un nodo y comparte un borde con otro nodo solo si las dos sublistas tienen algún elemento en común. Por lo tanto, una sublista combinada es básicamente un componente conectado en el gráfico. Combinar todos ellos es simplemente una cuestión de encontrar todos los componentes conectados y enumerarlos.

Esto se puede hacer mediante un recorrido simple sobre el gráfico. Se pueden usar tanto BFS como DFS , pero estoy usando DFS aquí ya que es algo más corto para mí.

l = [[''a'',''b'',''c''],[''b'',''d'',''e''],[''k''],[''o'',''p''],[''e'',''f''],[''p'',''a''],[''d'',''g'']] taken=[False]*len(l) l=map(set,l) def dfs(node,index): taken[index]=True ret=node for i,item in enumerate(l): if not taken[i] and not ret.isdisjoint(item): ret.update(dfs(item,i)) return ret def merge_all(): ret=[] for i,node in enumerate(l): if not taken[i]: ret.append(list(dfs(node,i))) return ret print merge_all()

Este es quizás un algoritmo más simple / más rápido y parece funcionar bien -

l = [[''a'', ''b'', ''c''], [''b'', ''d'', ''e''], [''k''], [''o'', ''p''], [''e'', ''f''], [''p'', ''a''], [''d'', ''g'']] len_l = len(l) i = 0 while i < (len_l - 1): for j in range(i + 1, len_l): # i,j iterate over all pairs of l''s elements including new # elements from merged pairs. We use len_l because len(l) # may change as we iterate i_set = set(l[i]) j_set = set(l[j]) if len(i_set.intersection(j_set)) > 0: # Remove these two from list l.pop(j) l.pop(i) # Merge them and append to the orig. list ij_union = list(i_set.union(j_set)) l.append(ij_union) # len(l) has changed len_l -= 1 # adjust ''i'' because elements shifted i -= 1 # abort inner loop, continue with next l[i] break i += 1 print l # prints [[''k''], [''a'', ''c'', ''b'', ''e'', ''d'', ''g'', ''f'', ''o'', ''p'']]

He encontrado itertools una opción rápida para fusionar listas y me solucionó este problema:

import itertools LL = set(itertools.chain.from_iterable(L)) # LL is {''a'', ''b'', ''c'', ''d'', ''e'', ''f'', ''g'', ''k'', ''o'', ''p''} for each in LL: components = [x for x in L if each in x] for i in components: L.remove(i) L += [list(set(itertools.chain.from_iterable(components)))] # then L = [[''k''], [''a'', ''c'', ''b'', ''e'', ''d'', ''g'', ''f'', ''o'', ''p'']]

Para grandes conjuntos, ordenar LL por frecuencia, desde los elementos más comunes hasta los menos, puede acelerar un poco las cosas.

Me encontré con el mismo problema de tratar de combinar listas con valores comunes. Este ejemplo puede ser lo que estás buscando. Solo se repite una vez sobre las listas y actualiza el conjunto de resultados a medida que avanza.

lists = [[''a'',''b'',''c''],[''b'',''d'',''e''],[''k''],[''o'',''p''],[''e'',''f''],[''p'',''a''],[''d'',''g'']] lists = sorted([sorted(x) for x in lists]) #Sorts lists in place so you dont miss things. Trust me, needs to be done. resultslist = [] #Create the empty result list. if len(lists) >= 1: # If your list is empty then you dont need to do anything. resultlist = [lists[0]] #Add the first item to your resultset if len(lists) > 1: #If there is only one list in your list then you dont need to do anything. for l in lists[1:]: #Loop through lists starting at list 1 listset = set(l) #Turn you list into a set merged = False #Trigger for index in range(len(resultlist)): #Use indexes of the list for speed. rset = set(resultlist[index]) #Get list from you resultset as a set if len(listset & rset) != 0: #If listset and rset have a common value then the len will be greater than 1 resultlist[index] = list(listset | rset) #Update the resultlist with the updated union of listset and rset merged = True #Turn trigger to True break #Because you found a match there is no need to continue the for loop. if not merged: #If there was no match then add the list to the resultset, so it doesnt get left out. resultlist.append(l) print resultlist

#

resultset = [[''a'', ''b'', ''c'', ''d'', ''e'', ''g'', ''f'', ''o'', ''p''], [''k'']]

Mi intento. Tiene un aspecto funcional.

#!/usr/bin/python from collections import defaultdict l = [[''a'',''b'',''c''],[''b'',''d'',''e''],[''k''],[''o'',''p''],[''e'',''f''],[''p'',''a''],[''d'',''g'']] hashdict = defaultdict(int) def hashit(x, y): for i in y: x[i] += 1 return x def merge(x, y): sums = sum([hashdict[i] for i in y]) if sums > len(y): x[0] = x[0].union(y) else: x[1] = x[1].union(y) return x hashdict = reduce(hashit, l, hashdict) sets = reduce(merge, l, [set(),set()]) print [list(sets[0]), list(sets[1])]

Necesitaba realizar la técnica de agrupamiento descrita por el OP millones de veces para listas bastante grandes y, por lo tanto, quería determinar cuál de los métodos sugeridos anteriormente es el más preciso y el más eficaz.

Ejecuté 10 pruebas para listas de entrada con un tamaño de 2 ^ 1 a 2 ^ 10 para cada método anterior, usando la misma lista de entrada para cada método, y medí el tiempo de ejecución promedio para cada algoritmo propuesto anteriormente en milisegundos. Aquí están los resultados:

Estos resultados me ayudaron a ver que de los métodos que siempre arrojan resultados correctos, @ jochen es el más rápido. Entre los métodos que no siempre arrojan resultados correctos, la solución de mak a menudo no incluye todos los elementos de entrada (es decir, falta la lista de miembros de la lista) y no se garantiza que las soluciones de braaksma, cmangla y asterisk se fusionen al máximo .

Es interesante que los dos algoritmos más rápidos y correctos tengan la mayor cantidad de votos ascendentes hasta la fecha, en el orden correcto.

Aquí está el código usado para ejecutar las pruebas:

from networkx.algorithms.components.connected import connected_components from itertools import chain from random import randint, random from collections import defaultdict, deque from copy import deepcopy from multiprocessing import Pool import networkx import datetime import os ## # @mimomu ## def mimomu(l): l = deepcopy(l) s = set(chain.from_iterable(l)) for i in s: components = [x for x in l if i in x] for j in components: l.remove(j) l += [list(set(chain.from_iterable(components)))] return l ## # @Howard ## def howard(l): out = [] while len(l)>0: first, *rest = l first = set(first) lf = -1 while len(first)>lf: lf = len(first) rest2 = [] for r in rest: if len(first.intersection(set(r)))>0: first |= set(r) else: rest2.append(r) rest = rest2 out.append(first) l = rest return out ## # Nx @Jochen Ritzel ## def jochen(l): l = deepcopy(l) def to_graph(l): G = networkx.Graph() for part in l: # each sublist is a bunch of nodes G.add_nodes_from(part) # it also imlies a number of edges: G.add_edges_from(to_edges(part)) return G def to_edges(l): """ treat `l` as a Graph and returns it''s edges to_edges([''a'',''b'',''c'',''d'']) -> [(a,b), (b,c),(c,d)] """ it = iter(l) last = next(it) for current in it: yield last, current last = current G = to_graph(l) return list(connected_components(G)) ## # Merge all @MAK ## def mak(l): l = deepcopy(l) taken=[False]*len(l) l=map(set,l) def dfs(node,index): taken[index]=True ret=node for i,item in enumerate(l): if not taken[i] and not ret.isdisjoint(item): ret.update(dfs(item,i)) return ret def merge_all(): ret=[] for i,node in enumerate(l): if not taken[i]: ret.append(list(dfs(node,i))) return ret result = list(merge_all()) return result ## # @cmangla ## def cmangla(l): l = deepcopy(l) len_l = len(l) i = 0 while i < (len_l - 1): for j in range(i + 1, len_l): # i,j iterate over all pairs of l''s elements including new # elements from merged pairs. We use len_l because len(l) # may change as we iterate i_set = set(l[i]) j_set = set(l[j]) if len(i_set.intersection(j_set)) > 0: # Remove these two from list l.pop(j) l.pop(i) # Merge them and append to the orig. list ij_union = list(i_set.union(j_set)) l.append(ij_union) # len(l) has changed len_l -= 1 # adjust ''i'' because elements shifted i -= 1 # abort inner loop, continue with next l[i] break i += 1 return l ## # @pillmuncher ## def pillmuncher(l): l = deepcopy(l) def connected_components(lists): neighbors = defaultdict(set) seen = set() for each in lists: for item in each: neighbors[item].update(each) def component(node, neighbors=neighbors, seen=seen, see=seen.add): nodes = set([node]) next_node = nodes.pop while nodes: node = next_node() see(node) nodes |= neighbors[node] - seen yield node for node in neighbors: if node not in seen: yield sorted(component(node)) return list(connected_components(l)) ## # @NicholasBraaksma ## def braaksma(l): l = deepcopy(l) lists = sorted([sorted(x) for x in l]) #Sorts lists in place so you dont miss things. Trust me, needs to be done. resultslist = [] #Create the empty result list. if len(lists) >= 1: # If your list is empty then you dont need to do anything. resultlist = [lists[0]] #Add the first item to your resultset if len(lists) > 1: #If there is only one list in your list then you dont need to do anything. for l in lists[1:]: #Loop through lists starting at list 1 listset = set(l) #Turn you list into a set merged = False #Trigger for index in range(len(resultlist)): #Use indexes of the list for speed. rset = set(resultlist[index]) #Get list from you resultset as a set if len(listset & rset) != 0: #If listset and rset have a common value then the len will be greater than 1 resultlist[index] = list(listset | rset) #Update the resultlist with the updated union of listset and rset merged = True #Turn trigger to True break #Because you found a match there is no need to continue the for loop. if not merged: #If there was no match then add the list to the resultset, so it doesnt get left out. resultlist.append(l) return resultlist ## # @Rumple Stiltskin ## def stiltskin(l): l = deepcopy(l) hashdict = defaultdict(int) def hashit(x, y): for i in y: x[i] += 1 return x def merge(x, y): sums = sum([hashdict[i] for i in y]) if sums > len(y): x[0] = x[0].union(y) else: x[1] = x[1].union(y) return x hashdict = reduce(hashit, l, hashdict) sets = reduce(merge, l, [set(),set()]) return list(sets) ## # @Asterisk ## def asterisk(l): l = deepcopy(l) results = {} for sm in [''min'', ''max'']: sort_method = min if sm == ''min'' else max l = sorted(l, key=lambda x:sort_method(x)) queue = deque(l) grouped = [] while len(queue) >= 2: l1 = queue.popleft() l2 = queue.popleft() s1 = set(l1) s2 = set(l2) if s1 & s2: queue.appendleft(s1 | s2) else: grouped.append(s1) queue.appendleft(s2) if queue: grouped.append(queue.pop()) results[sm] = grouped if len(results[''min'']) < len(results[''max'']): return results[''min''] return results[''max''] ## # Validate no more clusters can be merged ## def validate(output, L): # validate all sublists are maximally merged d = defaultdict(list) for idx, i in enumerate(output): for j in i: d[j].append(i) if any([len(i) > 1 for i in d.values()]): return ''not maximally merged'' # validate all items in L are accounted for all_items = set(chain.from_iterable(L)) accounted_items = set(chain.from_iterable(output)) if all_items != accounted_items: return ''missing items'' # validate results are good return ''true'' ## # Timers ## def time(func, L): start = datetime.datetime.now() result = func(L) delta = datetime.datetime.now() - start return result, delta ## # Function runner ## def run_func(args): func, L, input_size = args results, elapsed = time(func, L) validation_result = validate(results, L) return func.__name__, input_size, elapsed, validation_result ## # Main ## all_results = defaultdict(lambda: defaultdict(list)) funcs = [mimomu, howard, jochen, mak, cmangla, braaksma, asterisk] args = [] for trial in range(10): for s in range(10): input_size = 2**s # get some random inputs to use for all trials at this size L = [] for i in range(input_size): sublist = [] for j in range(randint(5, 10)): sublist.append(randint(0, 2**24)) L.append(sublist) for i in funcs: args.append([i, L, input_size]) pool = Pool() for result in pool.imap(run_func, args): func_name, input_size, elapsed, validation_result = result all_results[func_name][input_size].append({ ''time'': elapsed, ''validation'': validation_result, }) # show the running time for the function at this input size print(input_size, func_name, elapsed, validation_result) pool.close() pool.join() # write the average of time trials at each size for each function with open(''times.tsv'', ''w'') as out: for func in all_results: validations = [i[''validation''] for j in all_results[func] for i in all_results[func][j]] linetype = ''incorrect results'' if any([i != ''true'' for i in validations]) else ''correct results'' for input_size in all_results[func]: all_times = [i[''time''].microseconds for i in all_results[func][input_size]] avg_time = sum(all_times) / len(all_times) out.write(func + ''/t'' + str(input_size) + ''/t'' + / str(avg_time) + ''/t'' + linetype + ''/n'')

Y para trazar:

library(ggplot2) df <- read.table(''times.tsv'', sep=''/t'') p <- ggplot(df, aes(x=V2, y=V3, color=as.factor(V1))) + geom_line() + xlab(''number of input lists'') + ylab(''runtime (ms)'') + labs(color='''') + scale_x_continuous(trans=''log10'') + facet_wrap(~V4, ncol=1) ggsave(''runtimes.png'')

Puede ver su lista como una notación para un gráfico, es decir, [''a'',''b'',''c''] es un gráfico con 3 nodos conectados entre sí. El problema que está tratando de resolver es encontrar componentes conectados en este gráfico .

Puede usar NetworkX para esto, que tiene la ventaja de que está garantizado que es correcto:

l = [[''a'',''b'',''c''],[''b'',''d'',''e''],[''k''],[''o'',''p''],[''e'',''f''],[''p'',''a''],[''d'',''g'']] import networkx from networkx.algorithms.components.connected import connected_components def to_graph(l): G = networkx.Graph() for part in l: # each sublist is a bunch of nodes G.add_nodes_from(part) # it also imlies a number of edges: G.add_edges_from(to_edges(part)) return G def to_edges(l): """ treat `l` as a Graph and returns it''s edges to_edges([''a'',''b'',''c'',''d'']) -> [(a,b), (b,c),(c,d)] """ it = iter(l) last = next(it) for current in it: yield last, current last = current G = to_graph(l) print connected_components(G) # prints [[''a'', ''c'', ''b'', ''e'', ''d'', ''g'', ''f'', ''o'', ''p''], [''k'']]

Para resolverlo usted mismo de manera eficiente, tiene que convertir la lista en algo gráfico de todos modos, por lo que también podría usar networkX desde el principio.

Sin saber muy bien lo que quieres, decidí adivinar lo que querías decir: quiero encontrar cada elemento solo una vez.

#!/usr/bin/python def clink(l, acc): for sub in l: if sub.__class__ == list: clink(sub, acc) else: acc[sub]=1 def clunk(l): acc = {} clink(l, acc) print acc.keys() l = [[''a'', ''b'', ''c''], [''b'', ''d'', ''e''], [''k''], [''o'', ''p''], [''e'', ''f''], [''p'', ''a''], [''d'', ''g'']] clunk(l)

La salida se ve así:

[''a'', ''c'', ''b'', ''e'', ''d'', ''g'', ''f'', ''k'', ''o'', ''p'']