AWS Athena and S3 Partitioning

Athena is a great tool to query your data stored in S3 buckets.

To have the best performance and properly organize the files I wanted to use partitioning.

Although very common practice, I haven't found a nice and simple tutorial that would explain in detail how to properly store and configure the files in S3 so that I could take full advantage of the Athena partitioning features.

After some testing, I managed to figure out how to set it up.

This is what I did:

Starting from a CSV file with a datetime column, I wanted to create an Athena table, partitioned by date.

A basic google search led me to this page, but It was lacking some more detailing.

The biggest catch was to understand how the partitioning works.

Based on a datetime column(processeddate), I had to split the date into the year, month and day components to create new derived columns, which in turn I'll use as the partition keys to my table

Example of date component split to create the partition keys

I created the table in Athena with this command:

CREATE EXTERNAL TABLE IF NOT EXISTS dbname.tableexample(

`CUSTOMERID` string,

`QUOTEID` string,

`PROCESSEDDATE` timestamp

) PARTITIONED BY (

`YEAR` string ,

`MONTH` string ,

`DAY` string

)

ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'

WITH SERDEPROPERTIES (

'serialization.format' = ',',

'field.delim' = ','

) LOCATION 's3://yourbucket/'

TBLPROPERTIES ('has_encrypted_data'='false');

You'll see this output in your results window

Query successful. If your table has partitions, you need to load these partitions to be able to query data. You can either load all partitions or load them individually. If you use the load all partitions (MSCK REPAIR TABLE) command, partitions must be in a format understood by Hive. Learn more.

The derived columns are not present in the csv file which only contain `CUSTOMERID`, `QUOTEID` and `PROCESSEDDATE`, so Athena gets the partition keys from the S3 path.

When uploading your files to S3, this format needs to be used:

S3://yourbucket/year=2017/month=10/day=24/file.csv

Note that it explicitly uses the partition key names as the subfolders names in your S3 path..

Using the key names as the folder names is what enables the use of the auto partitioning feature of Athena.

After uploading new files, run MSCK REPAIR TABLE tablename and to add the new files to your table without you having to worry about manually creating partitions.

After running

MSCK REPAIR TABLE tableexample;

You'll see this on the results box

Partitions not in metastore: tableexample:year=2017/month=10/day=24

Repair: Added partition to metastore tableexample:year=2017/month=10/day=24

Now you can query your table and see your data stored on S3, organized by year, month and day folders.

Comments

RKARATOOctober 16, 2018 at 12:52 AM
Thank you for you comment! Nice to know I could contribute with my 2 cents!
ReplyDelete
Replies
Siva BudatiNovember 17, 2018 at 4:15 AM
Exactly the information I needed. Many thanks.
ReplyDelete
Replies
GopiMarch 26, 2021 at 11:55 PM
This comment has been removed by a blog administrator.
ReplyDelete
Replies
UnknownAugust 27, 2021 at 3:43 AM
This comment has been removed by a blog administrator.
ReplyDelete
Replies

Work Recipes et al

Search This Blog

AWS Athena and S3 Partitioning

Comments

Post a Comment

Popular posts from this blog

NodeJS MSSQL connection error "EINVALIDSTATE"

Generate SQL Loader Control Files