print - Python 3: ¿puede extraer objetos de byte de manejo de más de 4 GB?

python bytes to string (5)

Aquí está la solución completa, aunque parece que pickle.load ya no intenta volcar un archivo enorme (estoy en Python 3.5.2), por lo que estrictamente hablando, solo pickle.dumps necesita que esto funcione correctamente.

class MacOSFile(object): def __init__(self, f): self.f = f def __getattr__(self, item): return getattr(self.f, item) def read(self, n): # print("reading total_bytes=%s" % n, flush=True) if n >= (1 << 31): buffer = bytearray(n) idx = 0 while idx < n: batch_size = min(n - idx, 1 << 31 - 1) # print("reading bytes [%s,%s)..." % (idx, idx + batch_size), end="", flush=True) buffer[idx:idx + batch_size] = self.f.read(batch_size) # print("done.", flush=True) idx += batch_size return buffer return self.f.read(n) def write(self, buffer): n = len(buffer) print("writing total_bytes=%s..." % n, flush=True) idx = 0 while idx < n: batch_size = min(n - idx, 1 << 31 - 1) print("writing bytes [%s, %s)... " % (idx, idx + batch_size), end="", flush=True) self.f.write(buffer[idx:idx + batch_size]) print("done.", flush=True) idx += batch_size def pickle_dump(obj, file_path): with open(file_path, "wb") as f: return pickle.dump(obj, MacOSFile(f), protocol=pickle.HIGHEST_PROTOCOL) def pickle_load(file_path): with open(file_path, "rb") as f: return pickle.load(MacOSFile(f))

En función de este comment y de la documentación a la que se hace referencia, Pickle 4.0+ de Python 3.4+ debería ser capaz de extraer objetos de bytes de más de 4 GB.

Sin embargo, al utilizar Python 3.4.3 o Python 3.5.0b2 en Mac OS X 10.10.4, aparece un error cuando intento seleccionar una matriz de bytes grande:

>>> import pickle >>> x = bytearray(8 * 1000 * 1000 * 1000) >>> fp = open("x.dat", "wb") >>> pickle.dump(x, fp, protocol = 4) Traceback (most recent call last): File "<stdin>", line 1, in <module> OSError: [Errno 22] Invalid argument

¿Hay algún error en mi código o estoy malinterpretando la documentación?

Aquí hay una solución simple para el problema 24658 . Utilice pickle.loads o pickle.dumps y pickle.dumps el objeto bytes en fragmentos de tamaño 2**31 - 1 para obtenerlo dentro o fuera del archivo.

import pickle import os.path file_path = "pkl.pkl" n_bytes = 2**31 max_bytes = 2**31 - 1 data = bytearray(n_bytes) ## write bytes_out = pickle.dumps(data) with open(file_path, ''wb'') as f_out: for idx in range(0, len(bytes_out), max_bytes): f_out.write(bytes_out[idx:idx+max_bytes]) ## read bytes_in = bytearray(0) input_size = os.path.getsize(file_path) with open(file_path, ''rb'') as f_in: for _ in range(0, input_size, max_bytes): bytes_in += f_in.read(max_bytes) data2 = pickle.loads(bytes_in) assert(data == data2)

Leer un archivo en trozos de 2GB requiere el doble de memoria que la necesaria si se realiza la concatenación de bytes , mi enfoque para cargar pepinillos se basa en bytearray:

class MacOSFile(object): def __init__(self, f): self.f = f def __getattr__(self, item): return getattr(self.f, item) def read(self, n): if n >= (1 << 31): buffer = bytearray(n) pos = 0 while pos < n: size = min(n - pos, 1 << 31 - 1) chunk = self.f.read(size) buffer[pos:pos + size] = chunk pos += size return buffer return self.f.read(n)

Uso:

with open("/path", "rb") as fin: obj = pickle.load(MacOSFile(fin))

Para resumir lo que se respondió en los comentarios:

Sí, Python puede extraer objetos de bytes mayores de 4 GB. El error observado se debe a un error en la implementación (ver bugs.python.org/issue24658 ).

También encontré este problema, para resolver este problema, dividí el código en varias iteraciones. Digamos que en este caso tengo 50,000 datos que tengo que calc tf-idf y hacer knn classificación. Cuando ejecuto e itero directamente 50,000, me da "ese error". Entonces, para resolver este problema lo destrozo.

tokenized_documents = self.load_tokenized_preprocessing_documents() idf = self.load_idf_41227() doc_length = len(documents) for iteration in range(0, 9): tfidf_documents = [] for index in range(iteration, 4000): doc_tfidf = [] for term in idf.keys(): tf = self.term_frequency(term, tokenized_documents[index]) doc_tfidf.append(tf * idf[term]) doc = documents[index] tfidf = [doc_tfidf, doc[0], doc[1]] tfidf_documents.append(tfidf) print("{} from {} document {}".format(index, doc_length, doc[0])) self.save_tfidf_41227(tfidf_documents, iteration)