AWS Glue – AWS Glue is a serverless ETL tool developed by AWS. It is built on top of Spark. As spark is distributed processing engine by default it creates multiple output files states with
e.g.

Generating a Single file
You might have requirement to create single output file. In order for you to create single file – you can use repartition() or coalesce().
Spark RDD repartition() method is used to increase or decrease the partitions.
Spark RDD coalesce()
is used only to reduce the number of partitions.
Here is the example for DynamicFrame
A DynamicFrame
is similar to a DataFrame
, except that each record is self-describing, so no schema is required initially.
DynamicFrame.coalesce(1)
e.g.
## adding coalesce to dynamic frame
Transform4 = Transform4.coalesce(1)
## adding file to s3 location
DataSink0 = glueContext.write_dynamic_frame.from_options(frame = Transform4, connection_type = "s3", format = "csv", connection_options = {"path": "s3://outputpath/ ", "partitionKeys": []}, transformation_ctx = "DataSink0")
DataFrame
One of the major abstractions in Apache Spark is the SparkSQL DataFrame
, which is similar to the DataFrame
construct found in R and Pandas. A DataFrame
is similar to a table and supports functional-style (map/reduce/filter/etc.) operations and SQL operations (select, project, aggregate).
DataFrame.coalesce(1)
e.g.
new_df.coalesce(1).write.format("csv").mode("overwrite").option("codec", "gzip").save(outputpath)
Using coalesce(1) will create single file however file name will still remain in spark generated format e.g. start with part-0000

As S3 do not offer any custom function to rename file; In order to create a custom file name in S3; first step is to copy file with customer name and later delete the spark generated file.
Here is the code snippet which help you to generate customer file name and delete the spark generated file
import boto3
client = boto3.client('s3')
source_bucket = '<source_s3 bucket>' # e.g. - 'source-s3-bucket'
srcPrefix = '<folder>' # gzipcsvsingle/
target_bucket = '<source_s3 bucket>' # e.g. - 'target-s3-bucket'
targetPrefix = '<s3-folder>' # gzipcsvsingle/output/
try:
## Get a list of files with prefix (we know there will be only one file)
response = client.list_objects(
Bucket = source_bucket,
Prefix = srcPrefix,
Delimiter='/'
)
name = response["Contents"][0]["Key"]
print(name)
## Store Target File File Prefix, this is the new name of the file
target_source = {'Bucket': target_bucket, 'Key': name}
print(target_source)
target_key = targetPrefix + '<file-name>' # output.csv - File Name
print(target_key)
### Now Copy the file with New Name
client.copy(Bucket=source_bucket, CopySource=target_source, Key=target_key)
### Delete the old file
## client.delete_object(Bucket=source_bucket, Key=name)
except Exception as e:
## do nothing
print('error occured')
output –

Multiple jobs are running parallelly which creates file like run-xxx. In this case there will be more than one file at a time. How can we resolve this ?
LikeLike
easy way for you to have you jobs output in different folders or else it will get little tricky.
LikeLike