Issue – How to read\write different file format in HDFS by using pyspark
File Format | Action | Procedure | example without compression |
text File | Read | sc.textFile() | orders = sc.textFile(“/user/BDD/navnit/data-master/retail_db/orders”) |
Write | rdd.saveAsTextFile() | orders.saveAsTextFile(“/user/BDD/navnit/saveTextFile/orders”) | |
sequence File | Read | sc.sequenceFile(ordersSF = sc.sequenceFile(‘/user/BDD/navnit/saveSequenceFile/orders’) | |
Write | PipelinedRDD.saveAsSequenceFile() | ordersKV.saveAsSequenceFile(‘/user/BDD/navnit/saveSequenceFile/orders’) | |
Avro file | Read | sqlContext.read.format(“com.databricks.spark.avro”).load() | orders = sqlContext.read.format(“com.databricks.spark.avro”).load(“/home/BDD/navnit/orders/”) |
Write | dataFram.write.format(“com.databricks.spark.avro”).save() | orders.write.format(“com.databricks.spark.avro”).save(“/user/BDD/navnit/saveAvroFile/orders”) | |
Parquet File | Read | sqlContext.read.parquet() | ordersParquet = sqlContext.read.parquet(‘/user/BDD/navnit/saveparquetFile/orders’) |
Write | dataFram.write.parquet() | orders.write.parquet(“/user/BDD/navnit/saveparquetFile/orders”) | |
orc File | Read | sqlContext.read.orc() | ordersOrc = sqlContext.read.orc(“/user/BDD/navnit/saveorcFile/orders”) |
Write | dataFrame.write.orc() | orders.write.orc(“/user/BDD/navnit/saveorcFile/orders”) | |
JSON file | Read | sqlContext.read.json() | orderaJSON = sqlContext.read.json(“/user/BDD/navnit/saveJSONFile/orders”) |
Write | dataFrame.write.json() | orders.write.json(“/user/BDD/navnit/saveJSONFile/orders”) |