Create single file in AWS Glue (pySpark) and store as custom file name S3

AWS Glue – AWS Glue is a serverless ETL tool developed by AWS. It is built on top of Spark. As spark is distributed processing engine by default it creates multiple output files states with e.g. Generating a Single file You might have requirement to create single output file. In order for you to create… Read More Create single file in AWS Glue (pySpark) and store as custom file name S3

Reading\Writing Different file format in HDFS by using pyspark

Issue – How to read\write different file format in HDFS by using pyspark File Format Action Procedure example without compression text File Read sc.textFile() orders = sc.textFile(“/user/BDD/navnit/data-master/retail_db/orders”) Write rdd.saveAsTextFile() orders.saveAsTextFile(“/user/BDD/navnit/saveTextFile/orders”) sequence File Read sc.sequenceFile(ordersSF = sc.sequenceFile(‘/user/BDD/navnit/saveSequenceFile/orders’) Write PipelinedRDD.saveAsSequenceFile() ordersKV.saveAsSequenceFile(‘/user/BDD/navnit/saveSequenceFile/orders’) Avro file Read sqlContext.read.format(“com.databricks.spark.avro”).load() orders = sqlContext.read.format(“com.databricks.spark.avro”).load(“/home/BDD/navnit/orders/”) Write dataFram.write.format(“com.databricks.spark.avro”).save() orders.write.format(“com.databricks.spark.avro”).save(“/user/BDD/navnit/saveAvroFile/orders”) Parquet File Read sqlContext.read.parquet() ordersParquet =… Read More Reading\Writing Different file format in HDFS by using pyspark