Create single file in AWS Glue (pySpark) and store as custom file name S3

AWS Glue – AWS Glue is a serverless ETL tool developed by AWS. It is built on top of Spark. As spark is distributed processing engine by default it creates multiple output files states with

e.g.

Generating a Single file

You might have requirement to create single output file. In order for you to create single file – you can use repartition() or coalesce().

Spark RDD repartition() method is used to increase or decrease the partitions. 

Spark RDD coalesce() is used only to reduce the number of partitions.

Here is the example for DynamicFrame

DynamicFrame is similar to a DataFrame, except that each record is self-describing, so no schema is required initially.

DynamicFrame.coalesce(1)

e.g.

## adding coalesce to dynamic frame

Transform4 = Transform4.coalesce(1)

## adding file to s3 location

DataSink0 = glueContext.write_dynamic_frame.from_options(frame = Transform4, connection_type = "s3", format = "csv", connection_options = {"path": "s3://outputpath/ ", "partitionKeys": []}, transformation_ctx = "DataSink0")

DataFrame

One of the major abstractions in Apache Spark is the SparkSQL DataFrame, which is similar to the DataFrame construct found in R and Pandas. A DataFrame is similar to a table and supports functional-style (map/reduce/filter/etc.) operations and SQL operations (select, project, aggregate).

DataFrame.coalesce(1)

e.g.

new_df.coalesce(1).write.format("csv").mode("overwrite").option("codec", "gzip").save(outputpath)

Using coalesce(1) will create single file however file name will still remain in spark generated format e.g. start with part-0000

As S3 do not offer any custom function to rename file; In order to create a custom file name in S3; first step is to copy file with customer name and later delete the spark generated file.

Here is the code snippet which help you to generate customer file name and delete the spark generated file

import boto3

client = boto3.client('s3')


source_bucket = '<source_s3 bucket>' # e.g. - 'source-s3-bucket'
srcPrefix = '<folder>' # gzipcsvsingle/
target_bucket = '<source_s3 bucket>' # e.g. - 'target-s3-bucket'
targetPrefix = '<s3-folder>' # gzipcsvsingle/output/ 

try:

    ## Get a list of files with prefix (we know there will be only one file)
    
    response = client.list_objects(
        Bucket = source_bucket,
        Prefix = srcPrefix,
        Delimiter='/'
    )
    name = response["Contents"][0]["Key"]
    
    print(name)
    
    ## Store Target File File Prefix, this is the new name of the file
    target_source = {'Bucket': target_bucket, 'Key': name}
    
    print(target_source)
    
    target_key = targetPrefix + '<file-name>' # output.csv - File Name 
    
    print(target_key)
    
    ### Now Copy the file with New Name
    client.copy(Bucket=source_bucket, CopySource=target_source,  Key=target_key)
    
    ### Delete the old file
    ## client.delete_object(Bucket=source_bucket, Key=name)
    
except Exception as e:
        ## do nothing
        print('error occured')

output –

2 thoughts on “Create single file in AWS Glue (pySpark) and store as custom file name S3

  1. Multiple jobs are running parallelly which creates file like run-xxx. In this case there will be more than one file at a time. How can we resolve this ?

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s