node js is faster than java

Archivos grandes de Java, rendimiento de E/S de disco (10)

Tengo dos archivos (de 2 GB cada uno) en mi disco duro y quiero compararlos entre ellos:

Copiar los archivos originales con Windows Explorer toma aprox. 2-4 minutos (es decir, lectura y escritura, en el mismo disco físico y lógico).
Leer dos veces con java.io.FileInputStream y comparar las matrices de bytes en un byte por byte toma más de 20 minutos.
java.io.BufferedInputStream búfer java.io.BufferedInputStream es de 64kb, los archivos se leen en trozos y luego se comparan.
La comparación se hace es un lazo cerrado como
int numRead = Math.min(numRead[0], numRead[1]); for (int k = 0; k < numRead; k++) { if (buffer[1][k] != buffer[0][k]) { return buffer[0][k] - buffer[1][k]; } }

¿Qué puedo hacer para acelerar esto? ¿Se supone que NIO es más rápido que las corrientes simples? ¿Java no puede usar las tecnologías DMA / SATA y en su lugar hacen algunas llamadas lentas de la API del sistema operativo?

EDITAR:
Gracias por las respuestas. Hice algunos experimentos basados en ellos. Como Andreas mostró

Las corrientes o los enfoques nio no difieren mucho.
Más importante es el tamaño correcto del búfer.

Esto es confirmado por mis propios experimentos. Como los archivos se leen en trozos grandes, incluso los buffers adicionales ( BufferedInputStream ) no dan nada. Es posible optimizar la comparación y obtuve los mejores resultados con un desenrollado de 32 veces, pero el tiempo empleado en comparación es pequeño comparado con la lectura del disco, por lo que la aceleración es pequeña. Parece que no hay nada que pueda hacer ;-(

Con archivos tan grandes, obtendrá mucho mejor rendimiento con java.nio.

Además, la lectura de bytes individuales con flujos java puede ser muy lenta. El uso de una matriz de bytes (2-6K elementos de mis propias experiencias, ymmv como parece la plataforma / aplicación específica) mejorará dramáticamente su rendimiento de lectura con flujos.

DMA / SATA son hardware / tecnologías de bajo nivel y no son visibles para ningún lenguaje de programación.

Para la entrada / salida asignada en memoria, debe usar java.nio, creo.

¿Estás seguro de que no estás leyendo esos archivos por un byte? Eso sería un desperdicio, recomendaría hacerlo bloque por bloque, y cada bloque debería ser algo así como 64 megabytes para minimizar la búsqueda.

Descubrí que muchos de los artículos vinculados a este post están realmente desactualizados (también hay algunas cosas muy interesantes también). Hay algunos artículos vinculados desde 2001, y la información es, en el mejor de los casos, cuestionable. Martin Thompson, de la simpatía mecánica, escribió bastante acerca de esto en 2011. Consulte lo que escribió para conocer los antecedentes y la teoría de esto.

He encontrado que NIO o no NIO tiene muy poco que ver con el rendimiento. Es mucho más acerca del tamaño de sus buffers de salida (lea la matriz de bytes en esa). NIO no es mágico, hazlo ir a toda velocidad a escala web.

Pude tomar los ejemplos de Martin y usar el OutputStream de la era 1.0 y hacerlo gritar. NIO también es rápido, pero el indicador más importante es el tamaño del búfer de salida, ya sea que use NIO o no, a menos que esté utilizando un NIO asignado en la memoria, entonces es importante. :)

Si desea información actualizada sobre esto, consulte el blog de Martin:

http://mechanical-sympathy.blogspot.com/2011/12/java-sequential-io-performance.html

Si desea ver cómo NIO no hace una gran diferencia (ya que pude escribir ejemplos utilizando IO regular que fueron más rápidos) vea esto:

http://www.dzone.com/links/fast_java_io_nio_is_always_faster_than_fileoutput.html

He probado mi suposición en una nueva computadora portátil con Windows con un disco duro rápido, mi macbook pro con SSD, un EC2 xlarge y un EC2 4x de gran tamaño con IOPS / E / S de alta velocidad maximizados (y pronto en un disco grande de fibra NAS) array) para que funcione (hay algunos problemas con él para instancias de EC2 más pequeñas, pero si le importa el rendimiento ... ¿va a utilizar una instancia de EC2 pequeña?). Si utiliza hardware real, en mis pruebas hasta ahora, el IO tradicional siempre gana. Si usa alta / IO EC2, entonces esto también es un claro ganador. Si utiliza instancias de EC2 con poca energía, NIO puede ganar.

No hay sustitución para el benchmarking.

De todos modos, no soy un experto, solo hice algunas pruebas empíricas utilizando el marco que Sir Martin Thompson escribió en su blog.

Llevé esto al siguiente paso y utilicé Files.newInputStream (de JDK 7) con TransferQueue para crear una receta para hacer que Java I / O scream (incluso en pequeñas instancias de EC2). La receta se puede encontrar en la parte inferior de esta documentación para Boon ( https://github.com/RichardHightower/boon/wiki/Auto-Growable-Byte-Buffer-like-a-ByteBuilder ). Esto me permite usar un OutputStream tradicional pero con algo que funciona bien en instancias EC2 más pequeñas. (Soy el autor principal de Boon. Pero estoy aceptando nuevos autores. La paga apesta. 0 $ por hora. Pero la buena noticia es que puedo duplicar su paga cuando lo desee).

Mis 2 centavos.

Vea esto para ver por qué TransferQueue es importante. http://php.sabscape.com/blog/?p=557

Aprendizajes clave:

Si le importa el rendimiento nunca, nunca, nunca use BufferedOutputStream .
NIO no siempre tiene el mismo rendimiento.
El tamaño del búfer es lo más importante.
El reciclaje de buffers para escrituras de alta velocidad es crítico.
GC puede / quiere / hace implosionar su rendimiento para escrituras de alta velocidad.
Tienes que tener algún mecanismo para reutilizar los buffers gastados.

Después de modificar su función de comparación NIO obtengo los siguientes resultados.

I was equal, even after 4294967296 bytes and reading for 304594 ms (13.45MB/sec * 2) with a buffer size of 1024 kB I was equal, even after 4294967296 bytes and reading for 225078 ms (18.20MB/sec * 2) with a buffer size of 4096 kB I was equal, even after 4294967296 bytes and reading for 221351 ms (18.50MB/sec * 2) with a buffer size of 16384 kB

Nota: esto significa que los archivos se están leyendo a una velocidad de 37 MB / s

Ejecutar lo mismo en un disco más rápido

I was equal, even after 4294967296 bytes and reading for 178087 ms (23.00MB/sec * 2) with a buffer size of 1024 kB I was equal, even after 4294967296 bytes and reading for 119084 ms (34.40MB/sec * 2) with a buffer size of 4096 kB I was equal, even after 4294967296 bytes and reading for 109549 ms (37.39MB/sec * 2) with a buffer size of 16384 kB

Nota: esto significa que los archivos se están leyendo a una velocidad de 74.8 MB / s

private static boolean nioBuffersEqual(ByteBuffer first, ByteBuffer second, final int length) { if (first.limit() != second.limit() || length > first.limit()) { return false; } first.rewind(); second.rewind(); int i; for (i = 0; i < length-7; i+=8) { if (first.getLong() != second.getLong()) { return false; } } for (; i < length; i++) { if (first.get() != second.get()) { return false; } } return true; }

El siguiente es un buen artículo sobre los méritos relativos de las diferentes formas de leer un archivo en java. Puede ser de alguna utilidad:

Cómo leer archivos rápidamente

Intente configurar el búfer en el flujo de entrada hasta varios megabytes.

Leer y escribir los archivos con Java puede ser igual de rápido. Puedes usar FileChannels . En cuanto a la comparación de los archivos, obviamente esto llevará mucho tiempo la comparación de bytes a bytes. Este es un ejemplo que utiliza FileChannels y ByteBuffers (podría optimizarse aún más):

public static boolean compare(String firstPath, String secondPath, final int BUFFER_SIZE) throws IOException { FileChannel firstIn = null, secondIn = null; try { firstIn = new FileInputStream(firstPath).getChannel(); secondIn = new FileInputStream(secondPath).getChannel(); if (firstIn.size() != secondIn.size()) return false; ByteBuffer firstBuffer = ByteBuffer.allocateDirect(BUFFER_SIZE); ByteBuffer secondBuffer = ByteBuffer.allocateDirect(BUFFER_SIZE); int firstRead, secondRead; while (firstIn.position() < firstIn.size()) { firstRead = firstIn.read(firstBuffer); secondRead = secondIn.read(secondBuffer); if (firstRead != secondRead) return false; if (!buffersEqual(firstBuffer, secondBuffer, firstRead)) return false; } return true; } finally { if (firstIn != null) firstIn.close(); if (secondIn != null) firstIn.close(); } } private static boolean buffersEqual(ByteBuffer first, ByteBuffer second, final int length) { if (first.limit() != second.limit()) return false; if (length > first.limit()) return false; first.rewind(); second.rewind(); for (int i=0; i<length; i++) if (first.get() != second.get()) return false; return true; }

Para una mejor comparación, intente copiar dos archivos a la vez. Un disco duro puede leer un archivo de manera mucho más eficiente que leer dos (ya que la cabeza tiene que moverse hacia delante y hacia atrás para leer) Una forma de reducir esto es usar búferes más grandes, por ejemplo, 16 MB. con ByteBuffer.

Con ByteBuffer puede comparar 8 bytes a la vez comparando valores largos con getLong ()

Si su Java es eficiente, la mayor parte del trabajo está en el disco / SO para leer y escribir, por lo que no debería ser mucho más lento que usar cualquier otro idioma (ya que el disco / OS es el cuello de botella)

No asuma que Java es lento hasta que haya determinado que no es un error en su código.

Probé tres métodos diferentes para comparar dos archivos idénticos de 3,8 gb con tamaños de búfer entre 8 kb y 1 MB. El primer primer método utiliza solo dos flujos de entrada con buffer

el segundo enfoque utiliza un conjunto de subprocesos que se lee en dos subprocesos diferentes y se compara en un tercero. esto tuvo un rendimiento ligeramente mayor a expensas de una alta utilización de la CPU. la gestión del conjunto de subprocesos conlleva una gran cantidad de sobrecarga con esas tareas de ejecución corta.

el tercer enfoque usa nio, publicado por laginimaineb

Como puede ver, el enfoque general no difiere mucho. más importante es el tamaño correcto del búfer.

Lo que es extraño es que leo 1 byte menos usando hilos. No pude detectar el error difícil.

comparing just with two streams I was equal, even after 3684070360 bytes and reading for 704813 ms (4,98MB/sec * 2) with a buffer size of 8 kB I was equal, even after 3684070360 bytes and reading for 578563 ms (6,07MB/sec * 2) with a buffer size of 16 kB I was equal, even after 3684070360 bytes and reading for 515422 ms (6,82MB/sec * 2) with a buffer size of 32 kB I was equal, even after 3684070360 bytes and reading for 534532 ms (6,57MB/sec * 2) with a buffer size of 64 kB I was equal, even after 3684070360 bytes and reading for 422953 ms (8,31MB/sec * 2) with a buffer size of 128 kB I was equal, even after 3684070360 bytes and reading for 793359 ms (4,43MB/sec * 2) with a buffer size of 256 kB I was equal, even after 3684070360 bytes and reading for 746344 ms (4,71MB/sec * 2) with a buffer size of 512 kB I was equal, even after 3684070360 bytes and reading for 669969 ms (5,24MB/sec * 2) with a buffer size of 1024 kB comparing with threads I was equal, even after 3684070359 bytes and reading for 602391 ms (5,83MB/sec * 2) with a buffer size of 8 kB I was equal, even after 3684070359 bytes and reading for 523156 ms (6,72MB/sec * 2) with a buffer size of 16 kB I was equal, even after 3684070359 bytes and reading for 527547 ms (6,66MB/sec * 2) with a buffer size of 32 kB I was equal, even after 3684070359 bytes and reading for 276750 ms (12,69MB/sec * 2) with a buffer size of 64 kB I was equal, even after 3684070359 bytes and reading for 493172 ms (7,12MB/sec * 2) with a buffer size of 128 kB I was equal, even after 3684070359 bytes and reading for 696781 ms (5,04MB/sec * 2) with a buffer size of 256 kB I was equal, even after 3684070359 bytes and reading for 727953 ms (4,83MB/sec * 2) with a buffer size of 512 kB I was equal, even after 3684070359 bytes and reading for 741000 ms (4,74MB/sec * 2) with a buffer size of 1024 kB comparing with nio I was equal, even after 3684070360 bytes and reading for 661313 ms (5,31MB/sec * 2) with a buffer size of 8 kB I was equal, even after 3684070360 bytes and reading for 656156 ms (5,35MB/sec * 2) with a buffer size of 16 kB I was equal, even after 3684070360 bytes and reading for 491781 ms (7,14MB/sec * 2) with a buffer size of 32 kB I was equal, even after 3684070360 bytes and reading for 317360 ms (11,07MB/sec * 2) with a buffer size of 64 kB I was equal, even after 3684070360 bytes and reading for 643078 ms (5,46MB/sec * 2) with a buffer size of 128 kB I was equal, even after 3684070360 bytes and reading for 865016 ms (4,06MB/sec * 2) with a buffer size of 256 kB I was equal, even after 3684070360 bytes and reading for 716796 ms (4,90MB/sec * 2) with a buffer size of 512 kB I was equal, even after 3684070360 bytes and reading for 652016 ms (5,39MB/sec * 2) with a buffer size of 1024 kB

el código utilizado:

import junit.framework.Assert; import org.junit.Before; import org.junit.Test; import java.io.BufferedInputStream; import java.io.File; import java.io.FileInputStream; import java.io.IOException; import java.nio.ByteBuffer; import java.nio.channels.FileChannel; import java.text.DecimalFormat; import java.text.NumberFormat; import java.util.Arrays; import java.util.concurrent.*; public class FileCompare { private static final int MIN_BUFFER_SIZE = 1024 * 8; private static final int MAX_BUFFER_SIZE = 1024 * 1024; private String fileName1; private String fileName2; private long start; private long totalbytes; @Before public void createInputStream() { fileName1 = "bigFile.1"; fileName2 = "bigFile.2"; } @Test public void compareTwoFiles() throws IOException { System.out.println("comparing just with two streams"); int currentBufferSize = MIN_BUFFER_SIZE; while (currentBufferSize <= MAX_BUFFER_SIZE) { compareWithBufferSize(currentBufferSize); currentBufferSize *= 2; } } @Test public void compareTwoFilesFutures() throws IOException, ExecutionException, InterruptedException { System.out.println("comparing with threads"); int myBufferSize = MIN_BUFFER_SIZE; while (myBufferSize <= MAX_BUFFER_SIZE) { start = System.currentTimeMillis(); totalbytes = 0; compareWithBufferSizeFutures(myBufferSize); myBufferSize *= 2; } } @Test public void compareTwoFilesNio() throws IOException { System.out.println("comparing with nio"); int myBufferSize = MIN_BUFFER_SIZE; while (myBufferSize <= MAX_BUFFER_SIZE) { start = System.currentTimeMillis(); totalbytes = 0; boolean wasEqual = isEqualsNio(myBufferSize); if (wasEqual) { printAfterEquals(myBufferSize); } else { Assert.fail("files were not equal"); } myBufferSize *= 2; } } private void compareWithBufferSize(int myBufferSize) throws IOException { final BufferedInputStream inputStream1 = new BufferedInputStream( new FileInputStream(new File(fileName1)), myBufferSize); byte[] buff1 = new byte[myBufferSize]; final BufferedInputStream inputStream2 = new BufferedInputStream( new FileInputStream(new File(fileName2)), myBufferSize); byte[] buff2 = new byte[myBufferSize]; int read1; start = System.currentTimeMillis(); totalbytes = 0; while ((read1 = inputStream1.read(buff1)) != -1) { totalbytes += read1; int read2 = inputStream2.read(buff2); if (read1 != read2) { break; } if (!Arrays.equals(buff1, buff2)) { break; } } if (read1 == -1) { printAfterEquals(myBufferSize); } else { Assert.fail("files were not equal"); } inputStream1.close(); inputStream2.close(); } private void compareWithBufferSizeFutures(int myBufferSize) throws ExecutionException, InterruptedException, IOException { final BufferedInputStream inputStream1 = new BufferedInputStream( new FileInputStream( new File(fileName1)), myBufferSize); final BufferedInputStream inputStream2 = new BufferedInputStream( new FileInputStream( new File(fileName2)), myBufferSize); final boolean wasEqual = isEqualsParallel(myBufferSize, inputStream1, inputStream2); if (wasEqual) { printAfterEquals(myBufferSize); } else { Assert.fail("files were not equal"); } inputStream1.close(); inputStream2.close(); } private boolean isEqualsParallel(int myBufferSize , final BufferedInputStream inputStream1 , final BufferedInputStream inputStream2) throws InterruptedException, ExecutionException { final byte[] buff1Even = new byte[myBufferSize]; final byte[] buff1Odd = new byte[myBufferSize]; final byte[] buff2Even = new byte[myBufferSize]; final byte[] buff2Odd = new byte[myBufferSize]; final Callable<Integer> read1Even = new Callable<Integer>() { public Integer call() throws Exception { return inputStream1.read(buff1Even); } }; final Callable<Integer> read2Even = new Callable<Integer>() { public Integer call() throws Exception { return inputStream2.read(buff2Even); } }; final Callable<Integer> read1Odd = new Callable<Integer>() { public Integer call() throws Exception { return inputStream1.read(buff1Odd); } }; final Callable<Integer> read2Odd = new Callable<Integer>() { public Integer call() throws Exception { return inputStream2.read(buff2Odd); } }; final Callable<Boolean> oddEqualsArray = new Callable<Boolean>() { public Boolean call() throws Exception { return Arrays.equals(buff1Odd, buff2Odd); } }; final Callable<Boolean> evenEqualsArray = new Callable<Boolean>() { public Boolean call() throws Exception { return Arrays.equals(buff1Even, buff2Even); } }; ExecutorService executor = Executors.newCachedThreadPool(); boolean isEven = true; Future<Integer> read1 = null; Future<Integer> read2 = null; Future<Boolean> isEqual = null; int lastSize = 0; while (true) { if (isEqual != null) { if (!isEqual.get()) { return false; } else if (lastSize == -1) { return true; } } if (read1 != null) { lastSize = read1.get(); totalbytes += lastSize; final int size2 = read2.get(); if (lastSize != size2) { return false; } } isEven = !isEven; if (isEven) { if (read1 != null) { isEqual = executor.submit(oddEqualsArray); } read1 = executor.submit(read1Even); read2 = executor.submit(read2Even); } else { if (read1 != null) { isEqual = executor.submit(evenEqualsArray); } read1 = executor.submit(read1Odd); read2 = executor.submit(read2Odd); } } } private boolean isEqualsNio(int myBufferSize) throws IOException { FileChannel first = null, seconde = null; try { first = new FileInputStream(fileName1).getChannel(); seconde = new FileInputStream(fileName2).getChannel(); if (first.size() != seconde.size()) { return false; } ByteBuffer firstBuffer = ByteBuffer.allocateDirect(myBufferSize); ByteBuffer secondBuffer = ByteBuffer.allocateDirect(myBufferSize); int firstRead, secondRead; while (first.position() < first.size()) { firstRead = first.read(firstBuffer); totalbytes += firstRead; secondRead = seconde.read(secondBuffer); if (firstRead != secondRead) { return false; } if (!nioBuffersEqual(firstBuffer, secondBuffer, firstRead)) { return false; } } return true; } finally { if (first != null) { first.close(); } if (seconde != null) { seconde.close(); } } } private static boolean nioBuffersEqual(ByteBuffer first, ByteBuffer second, final int length) { if (first.limit() != second.limit() || length > first.limit()) { return false; } first.rewind(); second.rewind(); for (int i = 0; i < length; i++) { if (first.get() != second.get()) { return false; } } return true; } private void printAfterEquals(int myBufferSize) { NumberFormat nf = new DecimalFormat("#.00"); final long dur = System.currentTimeMillis() - start; double seconds = dur / 1000d; double megabytes = totalbytes / 1024 / 1024; double rate = (megabytes) / seconds; System.out.println("I was equal, even after " + totalbytes + " bytes and reading for " + dur + " ms (" + nf.format(rate) + "MB/sec * 2)" + " with a buffer size of " + myBufferSize / 1024 + " kB"); } }

Puede consultar el artículo de Suns para la optimización de E / S (aunque ya está un poco anticuado), tal vez pueda encontrar similitudes entre los ejemplos de allí y su código. También eche un vistazo al paquete java.nio que contiene elementos de E / S más rápidos que java.io. El Dr. Dobbs Journal tiene un artículo bastante bueno sobre IO de alto rendimiento utilizando java.nio .

Si es así, hay más ejemplos y sugerencias de ajuste disponibles que podrían ayudarte a acelerar tu código.

Además, la clase Arrays tiene métodos para comparar arrays de bytes incorporados , tal vez también se pueden usar para hacer las cosas más rápidas y aclarar un poco su bucle.