Create single file in AWS Glue (pySpark) and store as custom file name S3

AWS Glue – AWS Glue is a serverless ETL tool developed by AWS. It is built on top of Spark. As spark is distributed processing engine by default it creates multiple output files states with

e.g.

Generating a Single file

You might have requirement to create single output file. In order for you to create single file – you can use repartition() or coalesce().

Spark RDD repartition() method is used to increase or decrease the partitions. 

Spark RDD coalesce() is used only to reduce the number of partitions.

Here is the example for DynamicFrame

DynamicFrame is similar to a DataFrame, except that each record is self-describing, so no schema is required initially.

DynamicFrame.coalesce(1)

e.g.

## adding coalesce to dynamic frame

Transform4 = Transform4.coalesce(1)

## adding file to s3 location

DataSink0 = glueContext.write_dynamic_frame.from_options(frame = Transform4, connection_type = "s3", format = "csv", connection_options = {"path": "s3://outputpath/ ", "partitionKeys": []}, transformation_ctx = "DataSink0")

DataFrame

One of the major abstractions in Apache Spark is the SparkSQL DataFrame, which is similar to the DataFrame construct found in R and Pandas. A DataFrame is similar to a table and supports functional-style (map/reduce/filter/etc.) operations and SQL operations (select, project, aggregate).

DataFrame.coalesce(1)

e.g.

new_df.coalesce(1).write.format("csv").mode("overwrite").option("codec", "gzip").save(outputpath)

Using coalesce(1) will create single file however file name will still remain in spark generated format e.g. start with part-0000

As S3 do not offer any custom function to rename file; In order to create a custom file name in S3; first step is to copy file with customer name and later delete the spark generated file.

Here is the code snippet which help you to generate customer file name and delete the spark generated file

import boto3
s3 = boto3.resource('s3')

source_bucket = '<s3 bucket name>'
srcPrefix = '<s3 output prefix/folder>'

try:
    client = boto3.client('s3')
    
    ## Get a list of files with prefix (we know there will be only one file)
    
    response = client.list_objects(
        Bucket = source_bucket,
        Prefix = srcPrefix
    )
    name = response["Contents"][0]["Key"]
    
    
    
    
    print(name)
    
    ## Store Target File File Prefix, this is the new name of the file
    target_source = {'Bucket': source_bucket, 'Key': name}
    print(target_source)
    
    target_key = srcPrefix + '<custom file name>'
    
    print(target_key)
    
    ### Now Copy the file with New Name
    client.copy(CopySource=target_source, Bucket=source_bucket, Key=target_key)
    
    ### Delete the old file
    client.delete_object(Bucket=source_bucket, Key=name)
    
except Exception as e:
        ## do nothing
        print('error occured')

output –

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s