Weather can have a significant impact on user behavior. Many businesses think about the impact a hot summer day can have on their conversion rates. But overlapping effects from seasonality and weather require us to provide exact, session-based weather data. One approach is to extend the web analytics implementation and use a third-party weather service to fetch the information while the user is still on the site. However, these third-party services are not free, it requires us to get the user's location, and it also is an additional load for the site.
Wouldn't it be great if we could just use a public data set and some query magic to get per-session weather data historically and with no additional tracking? Guess what: We can! I prepared a little query that I want to share. But before you expect some full-on statistical analysis, be told that I am no data scientist. I'll get you the data, but someone with more stats skills has to get to the real insights.
The weather data we are using is part of our beloved
bigquery-public-data project. The National Oceanic and Atmospheric Administration (NOAA) makes a dataset available in this project that goes back to 1929 and is updated daily! The weather data is coming from 9000+ weather stations globally.
The GSOD dataset has one table with all the stations that ever recorded a weather measurement and annual tables with the individual measurements. The
stations table has the latitude and longitude information that we'll need later. To find the closest station for every user session, we have to cross join all stations and tables and calculate the geo distance. Then we can join the measurement of that day on to the session and station.
The cross join here is the queries Achilles heel. If we are not filtering the stations to a reasonable amount, the cross join creates a gigantic intermediate result. For me, it worked very well just filtering to a specific country for both the sessions and the stations.
WITH station AS ( SELECT wban, usaf, lon, lat FROM `bigquery-public-data.noaa_gsod.stations` WHERE lon IS NOT NULL AND lat IS NOT NULL AND country = "GM" ), data AS ( SELECT session.fullvisitorid, session.visitId, session.totals.totalTransactionRevenue totalTransactionRevenue, (measure.temp - 32) * (5/9) temp_cel, ST_DISTANCE( ST_GEOGPOINT( SAFE_CAST(session.geoNetwork.longitude AS FLOAT64), SAFE_CAST(session.geoNetwork.latitude AS FLOAT64) ), ST_GEOGPOINT(station.lon, station.lat) ) dist FROM `bigquery-public-data.google_analytics_sample.ga_sessions_2017*` session, station JOIN `bigquery-public-data.noaa_gsod.gsod2017` measure ON station.wban = measure.wban AND station.usaf = measure.stn AND session.geoNetwork.country = "Germany" AND CONCAT(measure.year, measure.mo, measure.da) = session.date ) SELECT fullvisitorId, visitId, totalTransactionRevenue, ARRAY_AGG(temp_cel ORDER BY dist LIMIT 1)[SAFE_OFFSET(0)] temp_cel FROM data GROUP BY 1,2,3
Let's go through this and explain what is going on.
- 26-37: Prepare the station data. The NOAA is using some unusual two-letter country codes. Make sure to adapt it to your country of analysis.
- 54-55: This is the mentioned cross join. Don't let the implicit comma-cross-join confuse you.
- 56-62: We are using a regular join to bring in the measurements and run a couple of filters.
- 45: The temperature data is in Fahrenheit, but I prefer Celcius.
- 46-52: The
ST_DISTANCEfunctions returns the distance between points on earth in meters. The Google Analytics schema represents the latitude and longitude as strings, se have to get a little more verbose here when creating the
- 70: To avoid a correlated query, I had to get a little creative. The use of
ARRAY_AGGallows us to order by
distbut return the
temp_celvalue. We limit to one to get an array with only one member and use
[SAFE_OFFSET(0)]to pull out the one member.
And there we have it: One row per session with the session and user identifier, the temperature in Celcius, and the total revenue. I'm sure you can come up with some ideas on how to use this?