Utilizing Write Sharding to Optimize Data Ingestion

45 minutes
  • 5 Learning Objectives

About this Hands-on Lab

In this lab, we investigate and improve a DynamoDB table loading script that is losing data by modifying the data item partition key to shard the table partitions.

Learning Objectives

Successfully complete this lab by achieving the following learning objectives:

Investigate Provided Instance and Data

Log in to the provided EC2 instance with the credentials provided in the lab information. Look at the data being written to the table with the following command:

cat dataload/bin_meta_10  | jq
(OPTIONAL) Run Unmodified `loadtable.py`

Run loadtable.py.py with the following command, and observe the results in the DynamoDB web console:

python3 loadtable.py > load.log &

(If you do this step, re-create the AmazonBins table with a partition key of Partition, which is a string, and a sort key of bin, which is also a string.)

Modify `loadtable.py`

Edit the transform function in loadtable.py to modify the Partition key for each item to create multiple partitions in the table. This can be accomplished by:

  • Generating random characters
  • Assigning alphanumeric partitions to records
  • Using some value from the existing data to increase the cardinality of the values stored in the Partition key
Run Modified `loadtable.py`

Run your modified version of loadtable.py:

python3 loadtable.py > load2.log &
Observe Results

In the DynamoDB console in the Metrics tab, observe the write capacity unit usage, throttled write request, and throttled write events. WCU usage should be around 2000, and both throttle metrics should be zero.

Additional Resources

Our coworker has written a script to load metadata about images gathered by robots in the Amazon warehouses of bins containing the items for orders. They have encountered some difficulty with the script where writes to their table AmazonBins are being throttled heavily and data is not making it to the table. They have modified the script to only try write operations once, which has sped up the script, but it is still losing data. Our coworker has stood up an environment for us with the AmazonBins table already created, and an EC2 instance has been pre-loaded with files containing the data as well as the script they have written to load the data into the AmazonBins table. We will need to modify the transform function in this script to implement partition sharding, which should eliminate throttling, and ensure all of the data is loaded to the table. Once this modification is made, we will run the script to test that no data is lost, Write Capacity Unit utilization has improved, and throttling has been eliminated.

Code used in the solution video for this lab can be found here.

What are Hands-on Labs

Hands-on Labs are real environments created by industry experts to help you learn. These environments help you gain knowledge and experience, practice without compromising your system, test without risk, destroy without fear, and let you learn from your mistakes. Hands-on Labs: practice your skills before delivering in the real world.

Sign In
Welcome Back!

Psst…this one if you’ve been moved to ACG!

Get Started
Who’s going to be learning?