5 Answers
Hi Cecil,
Good job on taking the Expert path! It’s hard to venture what might be different between the path I took and the path you took, but the real point was to dig in and provide you a reason to get your hands dirty…which you did so good job!
Some people who’ve gotten different results had left of some aggregations, average or max in their queries…if I recall the data, there were multiple readings per day….but it’s been almost a year since I dug into that case.
Congrats on finding your own path!
–Scott
I think that people are getting different answers, because when using QuickSight we’re visualizing the average of all the readings throughout the day and that is when Tulare has the highest value. While when we use just Athena to order the value entries in descending order, we will get other answers.
I thought I was going crazy because I was getting the same answer as DarkCecil. @Edu is correct, if you average the readings by city/location for the entire day, then you get TULARE:
SELECT city, location, avg(value) AS avgo3
FROM openaq_20181009
WHERE country = ‘US’
AND parameter = ‘o3’
GROUP BY city, location
ORDER BY 3 desc
I also choose the expert path, my way was:
1. Find the data source free for quality air, I got OpenQA
2. Export data from their S3 bucket to a table in Athena
`
CREATE EXTERNAL TABLE openaq
(
date
struct<utc:string,local:string> COMMENT ‘from deserializer’,
parameter
string COMMENT ‘from deserializer’,
location
string COMMENT ‘from deserializer’,
value
float COMMENT ‘from deserializer’,
unit
string COMMENT ‘from deserializer’,
city
string COMMENT ‘from deserializer’,
attribution
array<struct<name:string,url:string>> COMMENT ‘from deserializer’,
averagingperiod
struct<unit:string,value:float> COMMENT ‘from deserializer’,
coordinates
struct<latitude:float,longitude:float> COMMENT ‘from deserializer’,
country
string COMMENT ‘from deserializer’,
sourcename
string COMMENT ‘from deserializer’,
sourcetype
string COMMENT ‘from deserializer’,
mobile
string COMMENT ‘from deserializer’)
ROW FORMAT SERDE
‘org.openx.data.jsonserde.JsonSerDe’
STORED AS INPUTFORMAT
‘org.apache.hadoop.mapred.TextInputFormat’
OUTPUTFORMAT
‘org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat’
LOCATION
‘s3://openaq-fetches/realtime-gzipped’
TBLPROPERTIES (
‘transient_lastDdlTime’=’1518373755’)
`
3. Looking for a query to get the last question.
`
SELECT *
FROM
(SELECT city,
unit,
AVG(value) AS AvgO3
FROM "sampledb"."openaq"
WHERE country=’US’
AND parameter=’o3′
AND date.utc LIKE ‘%2018-10-09%’
GROUP BY city, unit) AS AVG_TABLE
WHERE AvgO3 =
(SELECT MAX(AvgO3)
FROM
(SELECT city,
unit,
AVG(value) AS AvgO3
FROM "sampledb"."openaq"
WHERE country=’US’
AND parameter=’o3′
AND date.utc LIKE ‘%2018-10-09%’
GROUP BY city, unit))
`
My result was
1 TULARE ppm 0.056333333
As has been said before
🙂
Result can be differ but overall it was good knowledge and understanding of data flow and result
I got the same answer as you. I used Athena and SQL and got the Create Table code from the OpenAQ website. (But wow, Glue is super awesome! It makes life so much easier!)
I’ve done the same. What’s missing is calculating the average over the whole day. Then the result matches Scott’s answer.