A juzgar por el error, tensorflow OOM intentó asignar un tensor de [10000, 23000] . Dado que 10,000 es el número de ejemplos generalmente en el conjunto de pruebas de MNIST, voy a suponer que tiene algún código de evaluación que intenta evaluar todo el conjunto de pruebas a la vez. Solo para las activaciones necesitarías 10000 * (784 + n + 10) ~= 1GB , lo que por sí solo no debería ser suficiente para OOM. Pero también hay 1.7GB de tensor asignado por alguna razón que es difícil de explicar.

Para el caso en la computadora portátil, le faltan algunas variables en su cálculo. Adam rastrea los momentos primero y segundo para cada variable, por lo que el 2.2GB se triplica para convertirse en 6.6GB. Agregue un poco de sobrecarga para los degradados que estarán en la memoria y eso explica que OOM.

Lamento que esto no responda completamente a su pregunta, lo habría agregado como comentario, pero aún no tengo la reputación para eso.

Estoy ejecutando Tensor Flow versión 0.7.1, habilitado para GPU de 64 bits, instalado con pip y en una PC con Ubuntu 14.04. Mi problema es que Tensor Flow se está quedando sin memoria cuando construyo mi red, aunque según mis cálculos, debería haber espacio suficiente en mi GPU.

A continuación se muestra un ejemplo mínimo de mi código, que se basa en el tutorial TENSOR Flow MNIST. La red es una red completamente conectada de dos capas, y la cantidad de nodos en la capa oculta está definida por la variable n . El tamaño del minibatch de entrenamiento es 1. Aquí está mi código:

n = 23000 mnist = read_data_sets(''MINST_Data'', one_hot=True) session = tf.InteractiveSession() x = tf.placeholder(tf.float32, [None, 784]) W1 = tf.Variable(tf.truncated_normal([784, n], stddev=0.1)) b1 = tf.Variable(tf.constant(0.1, shape=[n])) nn1 = tf.matmul(x, W1) + b1 W2 = tf.Variable(tf.truncated_normal([n, 10], stddev=0.1)) b2 = tf.Variable(tf.constant(0.1, shape=[10])) nn2 = tf.matmul(nn1, W2) + b2 y = tf.nn.softmax(nn2) y_ = tf.placeholder(tf.float32, [None, 10]) cross_entropy = -tf.reduce_sum(y_*tf.log(y)) train_step = tf.train.AdamOptimizer(1e-4).minimize(cross_entropy) correct_prediction = tf.equal(tf.argmax(y,1), tf.argmax(y_,1)) accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32)) init = tf.initialize_all_variables() sess = tf.Session() for i in range(1000): batch_xs, batch_ys = mnist.train.next_batch(1), feed_dict={x: batch_xs, y_: batch_ys})

Ahora, si n <= 22000 , la red funciona bien. Sin embargo, si n >= 23000 , n >= 23000 el siguiente error:

W tensorflow/core/common_runtime/gpu/] Ran out of memory trying to allocate 877.38MiB. See logs for memory state W tensorflow/core/kernels/] Resource exhausted: OOM when allocating tensor with shape[10000,23000]

Sin embargo, según mis cálculos, no debería haber un problema con la memoria. La cantidad de parámetros en la red es la siguiente:

First layer weights: 784 * n First layer biases: n Second layer weights: 10 * n Second layer biases: 10 Total: 795n + 10

Entonces, con n = 23000 y usando datos float32 , la memoria total requerida para la red debería ser, por lo tanto, de 73.1 MB.

Ahora, mi tarjeta gráfica es NVIDIA GeForce GTX 780 Ti, que tiene 3072 MB de memoria. Después de encontrar mi tarjeta gráfica, Tensor Flow imprime lo siguiente:

Total memory: 3.00GiB Free memory: 2.32GiB

Entonces, debería haber alrededor de 2.32 GB de memoria disponible, que es mucho mayor que los 73.1 MB calculados anteriormente. El tamaño del minibatch es 1, por lo que tiene un efecto mínimo. ¿Por qué recibo este error?

También probé esto en mi computadora portátil, que tiene una GPU NVIDA GeForce GTX 880M. Aquí, Tensor Flow lee la Free memory: 7.60GiB . Ejecutando el mismo código que el anterior, me da un error de memoria en torno a n = 700,000 , que es equivalente a 2,2 GB. Esto tiene más sentido y es significativamente más alto que el punto en el que se rompe el código de mi PC. Sin embargo, todavía me resulta desconcertante por qué no se acerca más a la marca de 7.6 GB.

La salida completa de Tensor Flow mientras ejecuto el código anterior en mi PC, con n = 23000 , es:

I tensorflow/stream_executor/cuda/] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero I tensorflow/core/common_runtime/gpu/] Found device 0 with properties: name: GeForce GTX 780 Ti major: 3 minor: 5 memoryClockRate (GHz) 1.0455 pciBusID 0000:01:00.0 Total memory: 3.00GiB Free memory: 2.32GiB I tensorflow/core/common_runtime/gpu/] DMA: 0 I tensorflow/core/common_runtime/gpu/] 0: Y I tensorflow/core/common_runtime/gpu/] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 780 Ti, pci bus id: 0000:01:00.0) I tensorflow/core/common_runtime/gpu/] Creating bin of max chunk size 1.0KiB I tensorflow/core/common_runtime/gpu/] Creating bin of max chunk size 2.0KiB I tensorflow/core/common_runtime/gpu/] Creating bin of max chunk size 4.0KiB I tensorflow/core/common_runtime/gpu/] Creating bin of max chunk size 8.0KiB I tensorflow/core/common_runtime/gpu/] Creating bin of max chunk size 16.0KiB I tensorflow/core/common_runtime/gpu/] Creating bin of max chunk size 32.0KiB I tensorflow/core/common_runtime/gpu/] Creating bin of max chunk size 64.0KiB I tensorflow/core/common_runtime/gpu/] Creating bin of max chunk size 128.0KiB I tensorflow/core/common_runtime/gpu/] Creating bin of max chunk size 256.0KiB I tensorflow/core/common_runtime/gpu/] Creating bin of max chunk size 512.0KiB I tensorflow/core/common_runtime/gpu/] Creating bin of max chunk size 1.00MiB I tensorflow/core/common_runtime/gpu/] Creating bin of max chunk size 2.00MiB I tensorflow/core/common_runtime/gpu/] Creating bin of max chunk size 4.00MiB I tensorflow/core/common_runtime/gpu/] Creating bin of max chunk size 8.00MiB I tensorflow/core/common_runtime/gpu/] Creating bin of max chunk size 16.00MiB I tensorflow/core/common_runtime/gpu/] Creating bin of max chunk size 32.00MiB I tensorflow/core/common_runtime/gpu/] Creating bin of max chunk size 64.00MiB I tensorflow/core/common_runtime/gpu/] Creating bin of max chunk size 128.00MiB I tensorflow/core/common_runtime/gpu/] Creating bin of max chunk size 256.00MiB I tensorflow/core/common_runtime/gpu/] Creating bin of max chunk size 512.00MiB I tensorflow/core/common_runtime/gpu/] Creating bin of max chunk size 1.00GiB I tensorflow/core/common_runtime/gpu/] Creating bin of max chunk size 2.00GiB I tensorflow/core/common_runtime/gpu/] Creating bin of max chunk size 4.00GiB I tensorflow/core/common_runtime/gpu/] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 780 Ti, pci bus id: 0000:01:00.0) I tensorflow/core/common_runtime/gpu/] Allocating 2.03GiB bytes. I tensorflow/core/common_runtime/gpu/] GPU 0 memory begins at 0xb04720000 extends to 0xb86295000 I tensorflow/core/common_runtime/gpu/] Bin (256): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. I tensorflow/core/common_runtime/gpu/] Bin (1024): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. I tensorflow/core/common_runtime/gpu/] Bin (2048): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. I tensorflow/core/common_runtime/gpu/] Bin (4096): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. I tensorflow/core/common_runtime/gpu/] Bin (8192): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. I tensorflow/core/common_runtime/gpu/] Bin (16384): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. I tensorflow/core/common_runtime/gpu/] Bin (32768): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. I tensorflow/core/common_runtime/gpu/] Bin (65536): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. I tensorflow/core/common_runtime/gpu/] Bin (131072): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. I tensorflow/core/common_runtime/gpu/] Bin (262144): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. I tensorflow/core/common_runtime/gpu/] Bin (524288): Total Chunks: 2, Chunks in use: 0 819.0KiB allocated for chunks. 390.6KiB client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. I tensorflow/core/common_runtime/gpu/] Bin (1048576): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. I tensorflow/core/common_runtime/gpu/] Bin (2097152): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. I tensorflow/core/common_runtime/gpu/] Bin (4194304): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. I tensorflow/core/common_runtime/gpu/] Bin (8388608): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. I tensorflow/core/common_runtime/gpu/] Bin (16777216): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. I tensorflow/core/common_runtime/gpu/] Bin (33554432): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. I tensorflow/core/common_runtime/gpu/] Bin (67108864): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. I tensorflow/core/common_runtime/gpu/] Bin (134217728): Total Chunks: 1, Chunks in use: 0 68.79MiB allocated for chunks. 29.91MiB client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. I tensorflow/core/common_runtime/gpu/] Bin (268435456): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. I tensorflow/core/common_runtime/gpu/] Bin (536870912): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. I tensorflow/core/common_runtime/gpu/] Bin (1073741824): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. I tensorflow/core/common_runtime/gpu/] Bin (2147483648): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. I tensorflow/core/common_runtime/gpu/] Bin (4294967296): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. I tensorflow/core/common_runtime/gpu/] Bin for 877.38MiB was 1.00GiB, Chunk State: I tensorflow/core/common_runtime/gpu/] Chunk at 0xb0d239400 of size 80128 I tensorflow/core/common_runtime/gpu/] Chunk at 0xb0d1d7600 of size 256 I tensorflow/core/common_runtime/gpu/] Chunk at 0xb0d24cd00 of size 438528 I tensorflow/core/common_runtime/gpu/] Chunk at 0xb0d1d7500 of size 256 I tensorflow/core/common_runtime/gpu/] Chunk at 0xb1a3e3200 of size 256 I tensorflow/core/common_runtime/gpu/] Chunk at 0xb1a302800 of size 920064 I tensorflow/core/common_runtime/gpu/] Chunk at 0xb15d58800 of size 920064 I tensorflow/core/common_runtime/gpu/] Chunk at 0xb08cf7500 of size 256 I tensorflow/core/common_runtime/gpu/] Chunk at 0xb04736b00 of size 256 I tensorflow/core/common_runtime/gpu/] Chunk at 0xb0d2b7f00 of size 256 I tensorflow/core/common_runtime/gpu/] Chunk at 0xb15e39200 of size 72128000 I tensorflow/core/common_runtime/gpu/] Chunk at 0xb08c16b00 of size 920064 I tensorflow/core/common_runtime/gpu/] Chunk at 0xb15c61500 of size 92160 I tensorflow/core/common_runtime/gpu/] Chunk at 0xb04736d00 of size 72128000 I tensorflow/core/common_runtime/gpu/] Chunk at 0xb0d2b8100 of size 72128000 I tensorflow/core/common_runtime/gpu/] Chunk at 0xb15c4ad00 of size 92160 I tensorflow/core/common_runtime/gpu/] Chunk at 0xb04736a00 of size 256 I tensorflow/core/common_runtime/gpu/] Chunk at 0xb0d2b7e00 of size 256 I tensorflow/core/common_runtime/gpu/] Chunk at 0xb0d1d7900 of size 400128 I tensorflow/core/common_runtime/gpu/] Chunk at 0xb04720200 of size 92160 I tensorflow/core/common_runtime/gpu/] Chunk at 0xb04736c00 of size 256 I tensorflow/core/common_runtime/gpu/] Chunk at 0xb08cf7600 of size 72128000 I tensorflow/core/common_runtime/gpu/] Chunk at 0xb1a3e3300 of size 1810570496 I tensorflow/core/common_runtime/gpu/] Chunk at 0xb0d1c0c00 of size 92160 I tensorflow/core/common_runtime/gpu/] Chunk at 0xb08c00300 of size 92160 I tensorflow/core/common_runtime/gpu/] Chunk at 0xb0d2b8000 of size 256 I tensorflow/core/common_runtime/gpu/] Chunk at 0xb0d1d7800 of size 256 I tensorflow/core/common_runtime/gpu/] Chunk at 0xb04720100 of size 256 I tensorflow/core/common_runtime/gpu/] Chunk at 0xb0d1d7700 of size 256 I tensorflow/core/common_runtime/gpu/] Chunk at 0xb04720000 of size 256 I tensorflow/core/common_runtime/gpu/] Chunk at 0xb0d1d7400 of size 256 I tensorflow/core/common_runtime/gpu/] Chunk at 0xb11781700 of size 72128000 I tensorflow/core/common_runtime/gpu/] Chunk at 0xb15c77d00 of size 256 I tensorflow/core/common_runtime/gpu/] Chunk at 0xb15c77e00 of size 920064 I tensorflow/core/common_runtime/gpu/] Summary of in-use Chunks by size: I tensorflow/core/common_runtime/gpu/] 16 Chunks of size 256 totalling 4.0KiB I tensorflow/core/common_runtime/gpu/] 1 Chunks of size 80128 totalling 78.2KiB I tensorflow/core/common_runtime/gpu/] 5 Chunks of size 92160 totalling 450.0KiB I tensorflow/core/common_runtime/gpu/] 1 Chunks of size 400128 totalling 390.8KiB I tensorflow/core/common_runtime/gpu/] 1 Chunks of size 438528 totalling 428.2KiB I tensorflow/core/common_runtime/gpu/] 4 Chunks of size 920064 totalling 3.51MiB I tensorflow/core/common_runtime/gpu/] 5 Chunks of size 72128000 totalling 343.93MiB I tensorflow/core/common_runtime/gpu/] 1 Chunks of size 1810570496 totalling 1.69GiB I tensorflow/core/common_runtime/gpu/] Sum Total of in-use chunks: 2.03GiB W tensorflow/core/common_runtime/gpu/] Ran out of memory trying to allocate 877.38MiB. See logs for memory state W tensorflow/core/kernels/] Resource exhausted: OOM when allocating tensor with shape[10000,23000] W tensorflow/core/common_runtime/] 0x50f40e0 Compute status: Resource exhausted: OOM when allocating tensor with shape[10000,23000] [[Node: add = Add[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:0"](MatMul, Variable_1/read)]] W tensorflow/core/common_runtime/] 0x3234d30 Compute status: Resource exhausted: OOM when allocating tensor with shape[10000,23000] [[Node: add = Add[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:0"](MatMul, Variable_1/read)]] [[Node: range_1/_13 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_97_range_1", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/cpu:0"]()]] W tensorflow/core/common_runtime/] 0x3234d30 Compute status: Resource exhausted: OOM when allocating tensor with shape[10000,23000] [[Node: add = Add[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:0"](MatMul, Variable_1/read)]] [[Node: Cast/_11 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_96_Cast", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]] Traceback (most recent call last): File "/home/jrowlay/Projects/Tensor_Flow_Tutorial/MNIST_CNN_Simple/", line 232, in <module> print(, feed_dict={x: mnist.test.images, y_: mnist.test.labels})) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/", line 315, in run return self._run(None, fetches, feed_dict) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/", line 511, in _run feed_dict_string) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/", line 564, in _do_run target_list) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/", line 586, in _do_call e.code) tensorflow.python.framework.errors.ResourceExhaustedError: OOM when allocating tensor with shape[10000,23000] [[Node: add = Add[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:0"](MatMul, Variable_1/read)]] [[Node: range_1/_13 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_97_range_1", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/cpu:0"]()]] Caused by op u''add'', defined at: File "/home/jrowlay/Projects/Tensor_Flow_Tutorial/MNIST_CNN_Simple/", line 215, in <module> nn1 = tf.matmul(x, W1) + b1 File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/", line 468, in binary_op_wrapper return func(x, y, name=name) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/", line 44, in add return _op_def_lib.apply_op("Add", x=x, y=y, name=name) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/", line 655, in apply_op op_def=op_def) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/", line 2040, in create_op original_op=self._default_original_op, op_def=op_def) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/", line 1087, in __init__ self._traceback = _extract_stack()

Encuentra el mismo error, solo reinicia el programa desde el portátil jupyter, se ejecuta correctamente. Aún no encuentras el motivo. Incluso ejecute la session = tf.InteractiveSession() única session = tf.InteractiveSession() aparece el mismo error. Espero eso ayude.

Solo para tu información. Tuve el mismo error en mi Macbook Pro. Después de cerrar algunas aplicaciones, ese problema ya no existe. Pero todavía tengo otros errores como:

Blockquote W tensorflow / core / common_runtime / 217] Se quedó sin memoria intentando asignar 214.51Mib. La persona que llama indica que esto no es una falla, pero puede significar que podría haber ganancias de rendimiento si hay más memoria disponible.

Entonces, ese es un verdadero problema de falta de memoria.