tutorial training spark espaƱol descargar course cdh hadoop apache-spark cloudera-cdh

training - descargar cloudera hadoop



No se puede leer un archivo desde HDFS usando Spark (9)

He instalado cloudera CDH 5 usando cloudera manager.

Puedo hacerlo fácilmente

hadoop fs -ls /input/war-and-peace.txt hadoop fs -cat /input/war-and-peace.txt

este comando anterior imprimirá todo el archivo txt en la consola.

ahora comienzo la chispa y digo

val textFile = sc.textFile("hdfs://input/war-and-peace.txt") textFile.count

Ahora recibo un error

El contexto de chispa está disponible como sc.

scala> val textFile = sc.textFile("hdfs://input/war-and-peace.txt") 2014-12-14 15:14:57,874 INFO [main] storage.MemoryStore (Logging.scala:logInfo(59)) - ensureFreeSpace(177621) called with curMem=0, maxMem=278302556 2014-12-14 15:14:57,877 INFO [main] storage.MemoryStore (Logging.scala:logInfo(59)) - Block broadcast_0 stored as values in memory (estimated size 173.5 KB, free 265.2 MB) textFile: org.apache.spark.rdd.RDD[String] = hdfs://input/war-and-peace.txt MappedRDD[1] at textFile at <console>:12 scala> textFile.count 2014-12-14 15:15:21,791 INFO [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 0 time(s); maxRetries=45 2014-12-14 15:15:41,905 INFO [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 1 time(s); maxRetries=45 2014-12-14 15:16:01,925 INFO [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 2 time(s); maxRetries=45 2014-12-14 15:16:21,983 INFO [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 3 time(s); maxRetries=45 2014-12-14 15:16:42,001 INFO [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 4 time(s); maxRetries=45 2014-12-14 15:17:02,062 INFO [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 5 time(s); maxRetries=45 2014-12-14 15:17:22,082 INFO [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 6 time(s); maxRetries=45 2014-12-14 15:17:42,116 INFO [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 7 time(s); maxRetries=45 2014-12-14 15:18:02,138 INFO [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 8 time(s); maxRetries=45 2014-12-14 15:18:22,298 INFO [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 9 time(s); maxRetries=45 2014-12-14 15:18:42,319 INFO [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 10 time(s); maxRetries=45 2014-12-14 15:19:02,354 INFO [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 11 time(s); maxRetries=45 2014-12-14 15:19:22,373 INFO [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 12 time(s); maxRetries=45 2014-12-14 15:19:42,424 INFO [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 13 time(s); maxRetries=45 2014-12-14 15:20:02,446 INFO [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 14 time(s); maxRetries=45 2014-12-14 15:20:22,512 INFO [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 15 time(s); maxRetries=45 2014-12-14 15:20:42,515 INFO [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 16 time(s); maxRetries=45 2014-12-14 15:21:02,550 INFO [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 17 time(s); maxRetries=45 2014-12-14 15:21:22,558 INFO [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 18 time(s); maxRetries=45 2014-12-14 15:21:42,683 INFO [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 19 time(s); maxRetries=45 2014-12-14 15:22:02,702 INFO [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 20 time(s); maxRetries=45 2014-12-14 15:22:22,832 INFO [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 21 time(s); maxRetries=45 2014-12-14 15:22:42,852 INFO [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 22 time(s); maxRetries=45 2014-12-14 15:23:02,974 INFO [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 23 time(s); maxRetries=45 2014-12-14 15:23:22,995 INFO [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 24 time(s); maxRetries=45 2014-12-14 15:23:43,109 INFO [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 25 time(s); maxRetries=45 2014-12-14 15:24:03,128 INFO [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 26 time(s); maxRetries=45 2014-12-14 15:24:23,250 INFO [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 27 time(s); maxRetries=45 java.net.ConnectException: Call From dn1home/192.168.1.21 to input:8020 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783) at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:730) at org.apache.hadoop.ipc.Client.call(Client.java:1415)

¿Por qué recibí este error? ¿Puedo leer el mismo archivo usando comandos de hadoop?


De alguna manera, resuelvo preguntas después de publicarlas en .

aquí está la solución

sc.textFile("hdfs://nn1home:8020/input/war-and-peace.txt")

Entonces, ¿cómo me enteré de nn1home: 8020?

simplemente busque el archivo core-site.xml y busque el elemento xml fs.defaultFS


Esto funcionó para mí

logFile = "hdfs://localhost:9000/sampledata/sample.txt"


No estás pasando una cadena de url adecuada.

  • hdfs:// - tipo de protocolo
  • localhost - dirección IP (puede ser diferente para usted, por ejemplo - 127.56.78.4)
  • 54310 - número de puerto
  • /input/war-and-peace.txt - Completa la ruta al archivo que deseas cargar.

Finalmente, la URL debería ser así

hdfs://localhost:54310/input/war-and-peace.txt


Puede ser un problema de ruta de archivo o URL y puerto hdfs también. Solución: Primero abra el archivo core-site.xml desde la ubicación $ HADOOP_HOME / etc / hadoop y verifique el valor de la propiedad "fs.defaultFS". Supongamos que el valor es hdfs: // localhost: 9000 y la ubicación del archivo en hdfs es: /home/usr/abc/fileName.txt Luego, la URL del archivo será: hdfs: // localhost: 9000 / home / usr / abc / fileName. txt y siguiente comando utilizado para leer el archivo de hdfs

var result = scontext.textFile ("hdfs: // localhost: 9000 / home / usr / abc / fileName.txt", 2)


Si comenzaste la chispa con HADOOP_HOME configurado en spark-env.sh, spark sabría dónde buscar los archivos de configuración de hdfs.

En este caso, la chispa ya conoce la ubicación de su namenode / datanode y solo a continuación debería funcionar bien para acceder a los archivos hdfs;

sc.textFie("/myhdfsdirectory/myfiletoprocess.txt")

Puedes crear tu myhdfsdirectory como abajo;

hdfs dfs -mkdir /myhdfsdirectory

y desde su sistema de archivos local puede mover su myfiletoprocess.txt al directorio hdfs usando el comando below

hdfs dfs -copyFromLocal mylocalfile /myhdfsdirectory/myfiletoprocess.txt


También estoy usando CDH5. Para mí, la ruta completa i, e "hdfs: // nn1home: 8020" no funciona por alguna extraña razón. La mayoría del ejemplo muestra la ruta así.

Usé el comando como

val textFile=sc.textFile("hdfs:/input1/Card_History2016_3rdFloor.csv")

o / p del comando anterior:

textFile: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[1] at textFile at <console>:22 textFile.count res1: Long = 58973

y esto funciona bien para mi


esto funcionará:

val textFile = sc.textFile ("hdfs: // localhost: 9000 / user / input.txt")

aquí, puede tomar localhost: 9000 del valor del parámetro fs.defaultFS del archivo de configuración de hadoop core-site.xml.


si quieres usar sc.textFile("hdfs://...") necesitas dar la ruta completa (ruta absoluta), en tu ejemplo sería "nn1home: 8020 / .."

Si quiere hacerlo simple, simplemente use sc.textFile("hdfs:/input/war-and-peace.txt")

Eso es solo uno /


val conf = new SparkConf().setMaster("local[*]").setAppName("HDFSFileReader") conf.set("fs.defaultFS", "hdfs://hostname:9000") val sc = new SparkContext(conf) val data = sc.textFile("hdfs://hostname:9000/hdfspath/") data.saveAsTextFile("C://dummy/")

el código anterior lee todos los archivos hdfs del directorio y los guarda localmente en la carpeta c: // dummy.