The explanation changed the partition key from using Date to Sensor. This removed the hot partition as one days readings was then no longer stored in a single partition. The request was then ‘give me all the sensor readings for 2018-01-01’ which was spread over all the partitions.
My question is asking the correctness of this.
In DynamoDB you can do either Scan or Query operations. Scans are expensive as you scan all items. Queries are better as they just hit the partition you are interested in and with a sort key you narrow down to the results you want.
With the ‘give me all the sensor readings for 2018-01-01’ request, it seems like you don’t know all the sensor ids so it’s a scan, which will use up capacity equal to the size of the table?
It might be you do have the list of all the sensors, but then that’s N queries of ‘give me all the sensor readings for 2018-01-01 for sensor id #1’ then repeat for #2, #3, and so on. (Moving complexity into the coding space)
Please correct me if I’m wrong.
If I am correct, there is then a new question of how should such queries be done?
Due to the way DynamoDB works, such aggregation queries are a bit painful.
It can be we pre-compute an aggregation and store this in DynamoDB so it is a single item read. This can work by using DynamoDB streams to trigger a lambda to do the aggregation. It might also be that rather than an aggregation on all devices we group devices by user or region. With this we might then have the following pk and sk
pk=USER sk=READING#DATE#SENSOR value=…
pk=USER sk=SUMMARY#DATE values=…
My sensors 1 and 2 make a reading each would result in the following dynamodb actions
pk=Andy sk=READING#2018-01-01#Sen1 value=1 (Single write, causes DynamoDB stream trigger causing..)
pk=Andy sk=SUMMARY#2018-01-01 values=
Later sensor 2 writes it’s data
pk=Andy sk=READING#2018-01-01#Sen2 value=3. (again this write triggers update of ..)
pk=Andy sk=SUMMARY#2018-01-01 values=[1,3]
There are then 3 items in the table, 2 readings and 1 summary.
We can then ask the question ‘give me all ‘Andy’s sensor readings for 2018-01-01’, pk of ‘Andy’, sk of ‘SUMMARY#2018-01-01’.
In the DynamoDB streams trigger lambda the query is possible with a pk of ‘Andy’ and a range between ‘READING#2018-01-01#R’ and ‘READING#2018-01-01#T’ and will pick up all of Andy’s sensor’s readings.
This does make a hot, or mildly warm, partition for ‘Andy’. However, we think there may be other groups of sensors for ‘Bob’ and ‘Alice’ too.
Sorry for the long post. I’m am enjoying the course. Keep up the good work.
The aggregation method you describe would work too, but as the economists say, there’s no such thing as a free lunch. The writes triggered from DynamoDB streams and calculated by Lambda also consume reads and writes, so by keeping a running aggregate, that requires 1 write (original), 1 read to fetch the current total sum, 1 write to update the new total sum (if I’m reading your example correctly).
If I have defined Date as a primary key on the table then fetching all records for a single date will be a query and not a scan…but it might create the hot-partition issue. We could instead, have the PK as the sensor ID or indeed some random GUID if we wanted…this has a better chance of distributing the records across partitions. Then, we can define a global secondary index for whatever fields we want…maybe date and some arbitrary fixed field ("windfarmsensor"). Then, a query for all records given a date + "windfarmsensor" would still be a query and not a scan.
There are tons of different ways to do things in Dynamo but the AWS docs do give specific examples of time-series data handling that you should know for the exam. https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-time-series.html