machine-learning neural-network tensorflow deep-learning regularized

machine learning - TensorFlow: introduce la regularización L2 y la deserción en la red. ¿Tiene algún sentido?



machine-learning neural-network (3)

Actualmente estoy jugando con ANN, que es parte del curso Udactity DeepLearning.

Construí exitosamente una red de trenes e introduje la regularización de L2 en todos los pesos y sesgos. En este momento estoy probando el abandono de la capa oculta para mejorar la generalización. Me pregunto, ¿tiene sentido que ambos introduzcan la regularización de L2 en la capa oculta y la deserción en esa misma capa? Si es así, ¿cómo hacer esto correctamente?

Durante el abandono literalmente apagamos la mitad de las activaciones de la capa oculta y duplicamos la cantidad producida por el resto de las neuronas. Mientras usamos el L2 calculamos la norma L2 en todos los pesos ocultos. Pero no estoy seguro de cómo calcular L2 en caso de que usemos dropout. Desactivamos algunas activaciones, ¿no deberíamos eliminar los pesos que no se usan ahora del cálculo de L2? Cualquier referencia sobre ese asunto será útil, no he encontrado ninguna información.

En caso de que esté interesado, mi código para ANN con regularización L2 está a continuación:

#for NeuralNetwork model code is below #We will use SGD for training to save our time. Code is from Assignment 2 #beta is the new parameter - controls level of regularization. Default is 0.01 #but feel free to play with it #notice, we introduce L2 for both biases and weights of all layers beta = 0.01 #building tensorflow graph graph = tf.Graph() with graph.as_default(): # Input data. For the training data, we use a placeholder that will be fed # at run time with a training minibatch. tf_train_dataset = tf.placeholder(tf.float32, shape=(batch_size, image_size * image_size)) tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size, num_labels)) tf_valid_dataset = tf.constant(valid_dataset) tf_test_dataset = tf.constant(test_dataset) #now let''s build our new hidden layer #that''s how many hidden neurons we want num_hidden_neurons = 1024 #its weights hidden_weights = tf.Variable( tf.truncated_normal([image_size * image_size, num_hidden_neurons])) hidden_biases = tf.Variable(tf.zeros([num_hidden_neurons])) #now the layer itself. It multiplies data by weights, adds biases #and takes ReLU over result hidden_layer = tf.nn.relu(tf.matmul(tf_train_dataset, hidden_weights) + hidden_biases) #time to go for output linear layer #out weights connect hidden neurons to output labels #biases are added to output labels out_weights = tf.Variable( tf.truncated_normal([num_hidden_neurons, num_labels])) out_biases = tf.Variable(tf.zeros([num_labels])) #compute output out_layer = tf.matmul(hidden_layer,out_weights) + out_biases #our real output is a softmax of prior result #and we also compute its cross-entropy to get our loss #Notice - we introduce our L2 here loss = (tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits( out_layer, tf_train_labels) + beta*tf.nn.l2_loss(hidden_weights) + beta*tf.nn.l2_loss(hidden_biases) + beta*tf.nn.l2_loss(out_weights) + beta*tf.nn.l2_loss(out_biases))) #now we just minimize this loss to actually train the network optimizer = tf.train.GradientDescentOptimizer(0.5).minimize(loss) #nice, now let''s calculate the predictions on each dataset for evaluating the #performance so far # Predictions for the training, validation, and test data. train_prediction = tf.nn.softmax(out_layer) valid_relu = tf.nn.relu( tf.matmul(tf_valid_dataset, hidden_weights) + hidden_biases) valid_prediction = tf.nn.softmax( tf.matmul(valid_relu, out_weights) + out_biases) test_relu = tf.nn.relu( tf.matmul( tf_test_dataset, hidden_weights) + hidden_biases) test_prediction = tf.nn.softmax(tf.matmul(test_relu, out_weights) + out_biases) #now is the actual training on the ANN we built #we will run it for some number of steps and evaluate the progress after #every 500 steps #number of steps we will train our ANN num_steps = 3001 #actual training with tf.Session(graph=graph) as session: tf.initialize_all_variables().run() print("Initialized") for step in range(num_steps): # Pick an offset within the training data, which has been randomized. # Note: we could use better randomization across epochs. offset = (step * batch_size) % (train_labels.shape[0] - batch_size) # Generate a minibatch. batch_data = train_dataset[offset:(offset + batch_size), :] batch_labels = train_labels[offset:(offset + batch_size), :] # Prepare a dictionary telling the session where to feed the minibatch. # The key of the dictionary is the placeholder node of the graph to be fed, # and the value is the numpy array to feed to it. feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels} _, l, predictions = session.run( [optimizer, loss, train_prediction], feed_dict=feed_dict) if (step % 500 == 0): print("Minibatch loss at step %d: %f" % (step, l)) print("Minibatch accuracy: %.1f%%" % accuracy(predictions, batch_labels)) print("Validation accuracy: %.1f%%" % accuracy( valid_prediction.eval(), valid_labels)) print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), test_labels))


En realidad, el documento original usa la regularización de norma máxima, y ​​no L2, además de la deserción: "La red neuronal se optimizó bajo la restricción || w || 2 ≤ c. Esta restricción se impuso durante la optimización al proyectar w en la superficie de una bola de radio c, siempre que salga de ella. Esto también se llama regularización de norma máxima ya que implica que el valor máximo que la norma de cualquier peso puede tomar es c "( http://jmlr.org/papers/volume15/srivastava14a/srivastava14a.pdf )

Puede encontrar un buen debate sobre este método de regularización aquí: https://plus.google.com/+IanGoodfellow/posts/QUaCJfvDpni


No hay inconveniente para usar regularizaciones múltiples. De hecho, hay un abandono en papel : una forma simple de evitar que las redes neuronales se sobrecalienten, donde los autores comprobaron cuánto ayuda. Claramente para diferentes conjuntos de datos tendrá diferentes resultados, pero para su MNIST:

puedes ver que Dropout + Max-norm da el error más bajo. Aparte de esto, tienes un gran error en tu código .

Usas l2_loss en pesos y sesgos:

beta*tf.nn.l2_loss(hidden_weights) + beta*tf.nn.l2_loss(hidden_biases) + beta*tf.nn.l2_loss(out_weights) + beta*tf.nn.l2_loss(out_biases)))

No deberías penalizar los altos prejuicios. Por lo tanto, elimine l2_loss por sesgos.


Ok, después de algunos esfuerzos adicionales logré resolverlo e introducir L2 y Dropout en mi red, el código está debajo. Obtuve una ligera mejoría sobre la misma red sin la deserción (con L2 en su lugar). Todavía no estoy seguro de si realmente vale la pena el esfuerzo presentar ambos, L2 y el abandono, pero al menos funciona y mejora ligeramente los resultados.

#ANN with introduced dropout #This time we still use the L2 but restrict training dataset #to be extremely small #get just first 500 of examples, so that our ANN can memorize whole dataset train_dataset_2 = train_dataset[:500, :] train_labels_2 = train_labels[:500] #batch size for SGD and beta parameter for L2 loss batch_size = 128 beta = 0.001 #that''s how many hidden neurons we want num_hidden_neurons = 1024 #building tensorflow graph graph = tf.Graph() with graph.as_default(): # Input data. For the training data, we use a placeholder that will be fed # at run time with a training minibatch. tf_train_dataset = tf.placeholder(tf.float32, shape=(batch_size, image_size * image_size)) tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size, num_labels)) tf_valid_dataset = tf.constant(valid_dataset) tf_test_dataset = tf.constant(test_dataset) #now let''s build our new hidden layer #its weights hidden_weights = tf.Variable( tf.truncated_normal([image_size * image_size, num_hidden_neurons])) hidden_biases = tf.Variable(tf.zeros([num_hidden_neurons])) #now the layer itself. It multiplies data by weights, adds biases #and takes ReLU over result hidden_layer = tf.nn.relu(tf.matmul(tf_train_dataset, hidden_weights) + hidden_biases) #add dropout on hidden layer #we pick up the probabylity of switching off the activation #and perform the switch off of the activations keep_prob = tf.placeholder("float") hidden_layer_drop = tf.nn.dropout(hidden_layer, keep_prob) #time to go for output linear layer #out weights connect hidden neurons to output labels #biases are added to output labels out_weights = tf.Variable( tf.truncated_normal([num_hidden_neurons, num_labels])) out_biases = tf.Variable(tf.zeros([num_labels])) #compute output #notice that upon training we use the switched off activations #i.e. the variaction of hidden_layer with the dropout active out_layer = tf.matmul(hidden_layer_drop,out_weights) + out_biases #our real output is a softmax of prior result #and we also compute its cross-entropy to get our loss #Notice - we introduce our L2 here loss = (tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits( out_layer, tf_train_labels) + beta*tf.nn.l2_loss(hidden_weights) + beta*tf.nn.l2_loss(hidden_biases) + beta*tf.nn.l2_loss(out_weights) + beta*tf.nn.l2_loss(out_biases))) #now we just minimize this loss to actually train the network optimizer = tf.train.GradientDescentOptimizer(0.5).minimize(loss) #nice, now let''s calculate the predictions on each dataset for evaluating the #performance so far # Predictions for the training, validation, and test data. train_prediction = tf.nn.softmax(out_layer) valid_relu = tf.nn.relu( tf.matmul(tf_valid_dataset, hidden_weights) + hidden_biases) valid_prediction = tf.nn.softmax( tf.matmul(valid_relu, out_weights) + out_biases) test_relu = tf.nn.relu( tf.matmul( tf_test_dataset, hidden_weights) + hidden_biases) test_prediction = tf.nn.softmax(tf.matmul(test_relu, out_weights) + out_biases) #now is the actual training on the ANN we built #we will run it for some number of steps and evaluate the progress after #every 500 steps #number of steps we will train our ANN num_steps = 3001 #actual training with tf.Session(graph=graph) as session: tf.initialize_all_variables().run() print("Initialized") for step in range(num_steps): # Pick an offset within the training data, which has been randomized. # Note: we could use better randomization across epochs. offset = (step * batch_size) % (train_labels_2.shape[0] - batch_size) # Generate a minibatch. batch_data = train_dataset_2[offset:(offset + batch_size), :] batch_labels = train_labels_2[offset:(offset + batch_size), :] # Prepare a dictionary telling the session where to feed the minibatch. # The key of the dictionary is the placeholder node of the graph to be fed, # and the value is the numpy array to feed to it. feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels, keep_prob : 0.5} _, l, predictions = session.run( [optimizer, loss, train_prediction], feed_dict=feed_dict) if (step % 500 == 0): print("Minibatch loss at step %d: %f" % (step, l)) print("Minibatch accuracy: %.1f%%" % accuracy(predictions, batch_labels)) print("Validation accuracy: %.1f%%" % accuracy( valid_prediction.eval(), valid_labels)) print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), test_labels))